一种新型的基于嵌入的句子的主题检测方法

论文标题

一种新型的基于嵌入的句子的主题检测方法

A novel sentence embedding based topic detection method for micro-blog

论文作者

Wan, Cong, Jiang, Shan, Wang, Cuirong, Wang, Cong, Xu, Changming, Chen, Xianxia, Yuan, Ying

论文摘要

主题检测是一项具有挑战性的任务，尤其是在不知道确切的主题数量的情况下。在本文中，我们提出了一种基于神经网络的新方法，以检测微博客数据集中的主题。我们使用无监督的神经句子嵌入模型将博客映射到嵌入空间。我们的模型是加权幂均值嵌入模型，权重是通过注意机制计算的。实验结果表明，我们的嵌入方法在句子聚类中的性能优于基准。此外，我们提出了一种改进的聚类算法，称为关系感知的DBSCAN（RADBSCAN）。它可以从微博数据集中发现主题，主题编号取决于数据集字符本身。此外，为了解决参数敏感的问题，我们将博客转发关系作为两个独立集群的桥梁。最后，我们从NINA微博上验证了我们的方法。结果表明，我们可以成功地检测所有主题并在每个主题中提取关键字。

Topic detection is a challenging task, especially without knowing the exact number of topics. In this paper, we present a novel approach based on neural network to detect topics in the micro-blogging dataset. We use an unsupervised neural sentence embedding model to map the blogs to an embedding space. Our model is a weighted power mean word embedding model, and the weights are calculated by attention mechanism. Experimental result shows our embedding method performs better than baselines in sentence clustering. In addition, we propose an improved clustering algorithm referred as relationship-aware DBSCAN (RADBSCAN). It can discover topics from a micro-blogging dataset, and the topic number depends on dataset character itself. Moreover, in order to solve the problem of parameters sensitive, we take blog forwarding relationship as a bridge of two independent clusters. Finally, we validate our approach on a dataset from sina micro-blog. The result shows that we can detect all the topics successfully and extract keywords in each topic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题