使用潜在主题和元数据强大的文档表示形式

论文标题

使用潜在主题和元数据强大的文档表示形式

Robust Document Representations using Latent Topics and Metadata

论文作者

Raman, Natraj, Nourbakhsh, Armineh, Shah, Sameena, Veloso, Manuela

论文摘要

在处理文档分类问题时，使用自定义的软磁输出层对预训练的神经语言模型进行特定的微调是事实上的方法。当训练时间没有标记的示例，并且必须利用文档中的元数据工件时，此技术就不足了。我们通过生成以任务不可知论捕获文本和元数据工件的文档表示来解决这些挑战。我们的新型自我监督方法不是传统的自动回归或自动编码培训，而是在生成文本嵌入时了解了输入空间的软派。具体而言，我们采用预学的主题模型分布作为替代标签，并基于KL Divergence构建损失函数。我们的解决方案还明确合并了元数据，而不仅仅是用文本增强它们。生成的文档嵌入具有组成特性，并由下游分类任务直接使用，以从少数标记的示例中创建决策边界，从而避开复杂的识别方法。我们通过广泛的评估证明，我们提出的跨模型融合解决方案的表现优于多个数据集上的几个竞争基准。

Task specific fine-tuning of a pre-trained neural language model using a custom softmax output layer is the de facto approach of late when dealing with document classification problems. This technique is not adequate when labeled examples are not available at training time and when the metadata artifacts in a document must be exploited. We address these challenges by generating document representations that capture both text and metadata artifacts in a task agnostic manner. Instead of traditional auto-regressive or auto-encoding based training, our novel self-supervised approach learns a soft-partition of the input space when generating text embeddings. Specifically, we employ a pre-learned topic model distribution as surrogate labels and construct a loss function based on KL divergence. Our solution also incorporates metadata explicitly rather than just augmenting them with text. The generated document embeddings exhibit compositional characteristics and are directly used by downstream classification tasks to create decision boundaries from a small number of labeled examples, thereby eschewing complicated recognition methods. We demonstrate through extensive evaluation that our proposed cross-model fusion solution outperforms several competitive baselines on multiple datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题