Moto：增强嵌入中文文本分类的多个联合因素

论文标题

Moto：增强嵌入中文文本分类的多个联合因素

Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification

论文作者

Tang, Xunzhu, Zhu, Rujie, Sun, Tiezhu, Wang, Shi

论文摘要

最近，语言表示技术在文本分类方面取得了出色的表现。但是，大多数现有的表示模型都是专门为英语材料设计的，由于这两种语言之间的差异很大，因此中文可能会失败。实际上，几乎没有单个级别的中文文本分类过程文本文本的现有方法。但是，作为一种特殊的象形文字，汉字的激进分子是良好的语义载体。此外，拼音代码带有色调的语义，Wubi反映了中风结构信息，\ textit {etc}。不幸的是，先前的研究忽略了找到一种有效的方法来提炼这四个因素的有用部分并融合它们。在我们的作品中，我们提出了一个名为Moto的新颖模型：用\ textbf {m} untiple j \ textbf {o} int fac \ textbf {to} rs增强嵌入。具体而言，我们设计了一种注意机制，通过更有效地融合四级信息来提炼有用的部分。我们对四个流行任务进行了广泛的实验。经验结果表明，我们的Moto在中国新闻标题上实现了SOTA 0.8316（$ f_1 $ -score，2.11 \％提高），Fudan Corpus的96.38（1.24 \％改善）和0.9633（3.26 \％\％改善）对TheCnews进行。

Recently, language representation techniques have achieved great performances in text classification. However, most existing representation models are specifically designed for English materials, which may fail in Chinese because of the huge difference between these two languages. Actually, few existing methods for Chinese text classification process texts at a single level. However, as a special kind of hieroglyphics, radicals of Chinese characters are good semantic carriers. In addition, Pinyin codes carry the semantic of tones, and Wubi reflects the stroke structure information, \textit{etc}. Unfortunately, previous researches neglected to find an effective way to distill the useful parts of these four factors and to fuse them. In our works, we propose a novel model called Moto: Enhancing Embedding with \textbf{M}ultiple J\textbf{o}int Fac\textbf{to}rs. Specifically, we design an attention mechanism to distill the useful parts by fusing the four-level information above more effectively. We conduct extensive experiments on four popular tasks. The empirical results show that our Moto achieves SOTA 0.8316 ($F_1$-score, 2.11\% improvement) on Chinese news titles, 96.38 (1.24\% improvement) on Fudan Corpus and 0.9633 (3.26\% improvement) on THUCNews.

下载PDF全文

下载文献需遵守相关版权规定

论文标题