论文标题
比较多组和序列的距离
Distances for Comparing Multisets and Sequences
论文作者
论文摘要
测量数据点之间的距离是许多统计技术的基础,例如缩小尺寸或聚类算法。但是,数据收集技术的改进导致结构化数据的多功能性越来越多,标准距离措施不适用。在本文中,我们考虑了测量序列和多个点数序列之间距离之间的距离的问题,这是由对足球内足球数据集的分析进行的。利用更广泛的文献,包括时间序列分析和最佳运输的文献,我们讨论了在这种情况下可用的各种距离。对于每个距离,我们说明并证明理论属性,并提出可能的扩展。最后,通过对足球内数据的示例分析,我们说明了这些距离在实践中的有用性。
Measuring the distance between data points is fundamental to many statistical techniques, such as dimension reduction or clustering algorithms. However, improvements in data collection technologies has led to a growing versatility of structured data for which standard distance measures are inapplicable. In this paper, we consider the problem of measuring the distance between sequences and multisets of points lying within a metric space, motivated by the analysis of an in-play football data set. Drawing on the wider literature, including that of time series analysis and optimal transport, we discuss various distances which are available in such an instance. For each distance, we state and prove theoretical properties, proposing possible extensions where they fail. Finally, via an example analysis of the in-play football data, we illustrate the usefulness of these distances in practice.