论文标题
通过整合多种类型的上下文,有效的种子引导的主题发现
Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts
论文作者
论文摘要
种子引导的主题发现方法并没有以完全无监督的方式挖掘一致的主题,而是利用用户提供的种子单词来提取独特而连贯的主题,以便挖掘的主题可以更好地满足用户的兴趣。为了建模单词和种子之间的语义相关性,以发现主题 - 概括术语,现有的种子引导方法利用了不同类型的上下文信号,例如文档级单词共发生,基于窗口的局部上下文以及预先培训的语言模型带来的通用语言知识。在这项工作中,我们从经验上分析并证明,每种类型的上下文信息都具有在种子指导下建模单词语义时的价值和限制,但是结合了三种类型的上下文(即,从本地上下文中学到的单词嵌入,从当地培训的语言模型表示,从一般训练和主题指示性句子中获得的培训和基于播种信息的句子的句子来培养的语言模型)可以使他们对互补的互补构成互补的效果。我们提出了一个迭代框架,播种机,该框架共同从三种类型的上下文中学习,并通过集合排名过程逐渐融合其上下文信号。在各种种子和多个数据集中,与现有的种子引导的主题发现方法相比,种子topimine始终产生更连贯和准确的主题。
Instead of mining coherent topics from a given text corpus in a completely unsupervised manner, seed-guided topic discovery methods leverage user-provided seed words to extract distinctive and coherent topics so that the mined topics can better cater to the user's interest. To model the semantic correlation between words and seeds for discovering topic-indicative terms, existing seed-guided approaches utilize different types of context signals, such as document-level word co-occurrences, sliding window-based local contexts, and generic linguistic knowledge brought by pre-trained language models. In this work, we analyze and show empirically that each type of context information has its value and limitation in modeling word semantics under seed guidance, but combining three types of contexts (i.e., word embeddings learned from local contexts, pre-trained language model representations obtained from general-domain training, and topic-indicative sentences retrieved based on seed information) allows them to complement each other for discovering quality topics. We propose an iterative framework, SeedTopicMine, which jointly learns from the three types of contexts and gradually fuses their context signals via an ensemble ranking process. Under various sets of seeds and on multiple datasets, SeedTopicMine consistently yields more coherent and accurate topics than existing seed-guided topic discovery approaches.