论文标题
主题通过预验证的语言模型表示的潜在空间聚类发现
Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations
论文作者
论文摘要
主题模型是文本语料库自动发现自动主题发现的突出工具。尽管具有有效性,但主题模型仍遭受了几个局限性,包括无法在文档中建模单词排序信息,纳入外部语言知识的困难以及缺乏准确,有效的推理方法来近似于棘手的后部。最近,审计的语言模型(PLM)由于文本的出色表现,为各种任务带来了惊人的绩效改进。有趣的是,没有标准方法将PLM部署为主题发现作为主题模型的更好替代方案。在本文中,我们首先分析使用PLM表示进行主题发现的挑战,然后提出一个基于PLM嵌入的联合潜在空间学习和聚类框架。在潜在空间中,主题字和文档主题分布是共同建模的,以便可以通过连贯和独特的术语来解释发现的主题,同时可以作为文档的有意义的摘要。我们的模型有效地利用了PLM为主题发现带来的强大表示能力和出色的语言特征,并且在概念上比主题模型更简单。在不同域中的两个基准数据集中,我们的模型比强主题模型产生的相比,产生更连贯和多样化的主题,并根据自动和人类评估提供更好的主题文档表示形式。
Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in documents, the difficulty of incorporating external linguistic knowledge, and the lack of both accurate and efficient inference methods for approximating the intractable posterior. Recently, pretrained language models (PLMs) have brought astonishing performance improvements to a wide variety of tasks due to their superior representations of text. Interestingly, there have not been standard approaches to deploy PLMs for topic discovery as better alternatives to topic models. In this paper, we begin by analyzing the challenges of using PLM representations for topic discovery, and then propose a joint latent space learning and clustering framework built upon PLM embeddings. In the latent space, topic-word and document-topic distributions are jointly modeled so that the discovered topics can be interpreted by coherent and distinctive terms and meanwhile serve as meaningful summaries of the documents. Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery, and is conceptually simpler than topic models. On two benchmark datasets in different domains, our model generates significantly more coherent and diverse topics than strong topic models, and offers better topic-wise document representations, based on both automatic and human evaluations.