论文标题
改善使用室外分类的在线讨论的无监督神经方面提取
Improving unsupervised neural aspect extraction for online discussions using out-of-domain classification
论文作者
论文摘要
基于自我注意力的深度学习架构最近在无监督的方面提取和主题建模的任务中实现并超越了最先进的状态。尽管基于神经注意力的方面提取(ABAE)之类的模型已成功地应用于用户生成的文本,但当应用于新闻文章和新闻组文档等传统数据源时,它们的连贯性较小。在这项工作中,我们介绍了一种基于句子过滤的简单方法,以改善从基于新闻组的内容中学到的主题方面而不修改ABAE的基本机制。我们训练一个概率分类器,以区分室外文本(外部数据集)和内域文本(目标数据集)。然后,在数据制备过程中,我们滤除了句子,这些句子的可能性很低,并在其余句子上训练神经模型。与在未经过滤的文本上训练的方面提取模型相比,句子过滤对主题连贯性的积极影响得到了证明。
Deep learning architectures based on self-attention have recently achieved and surpassed state of the art results in the task of unsupervised aspect extraction and topic modeling. While models such as neural attention-based aspect extraction (ABAE) have been successfully applied to user-generated texts, they are less coherent when applied to traditional data sources such as news articles and newsgroup documents. In this work, we introduce a simple approach based on sentence filtering in order to improve topical aspects learned from newsgroups-based content without modifying the basic mechanism of ABAE. We train a probabilistic classifier to distinguish between out-of-domain texts (outer dataset) and in-domain texts (target dataset). Then, during data preparation we filter out sentences that have a low probability of being in-domain and train the neural model on the remaining sentences. The positive effect of sentence filtering on topic coherence is demonstrated in comparison to aspect extraction models trained on unfiltered texts.