论文标题
从主题分布的有效聚类
Efficient Clustering from Distributions over Topics
论文作者
论文摘要
在许多情况下,我们可能希望在大型语料库中找到一对具有文本类似的文档(例如,进行文学审查的研究人员或R&D项目经理分析项目建议)。以编程方式发现这些连接可以帮助专家实现这些目标,但是当文档语料库的大小太大时,蛮力成对比较在计算上并不足够。文献中的某些算法将搜索空间分为包含潜在类似文档的区域,后来与其他文档分开处理,以减少比较的比例。但是,这种无监督的方法仍在高昂的时间成本中产生。在本文中,我们提出了一种依赖于集合中文档上的主题建模算法的结果的方法,作为识别可以计算相似性函数的较小文档子集的一种手段。在确定科学出版物领域中的类似文件时,这种方法已证明可以获得有希望的结果。我们已经将我们的方法与最先进的聚类技术和主题建模算法的配置进行了比较。结果表明,我们的方法在效率方面优于其他分析技术。
There are many scenarios where we may want to find pairs of textually similar documents in a large corpus (e.g. a researcher doing literature review, or an R&D project manager analyzing project proposals). To programmatically discover those connections can help experts to achieve those goals, but brute-force pairwise comparisons are not computationally adequate when the size of the document corpus is too large. Some algorithms in the literature divide the search space into regions containing potentially similar documents, which are later processed separately from the rest in order to reduce the number of pairs compared. However, this kind of unsupervised methods still incur in high temporal costs. In this paper, we present an approach that relies on the results of a topic modeling algorithm over the documents in a collection, as a means to identify smaller subsets of documents where the similarity function can then be computed. This approach has proved to obtain promising results when identifying similar documents in the domain of scientific publications. We have compared our approach against state of the art clustering techniques and with different configurations for the topic modeling algorithm. Results suggest that our approach outperforms (> 0.5) the other analyzed techniques in terms of efficiency.