使用子词表示学习和意识到自我注意

论文标题

使用子词表示学习和意识到自我注意

Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention

论文作者

Ali, Wazir, Kumar, Jay, Tumrani, Saifullah, Nour, Redhwan, Noor, Adeeb, Xu, Zenglin

论文摘要

由于空间遗漏和插入问题，Sindhi单词分割是一项具有挑战性的任务。信德语言本身增加了这种复杂性。它是草的，由具有固有的连接和非连接属性的字符组成，独立于单词边界。现有的sindhi单词分割方法依赖于设计和组合手工制作的功能。但是，这些方法具有局限性，例如难以处理量不计的单词，其他语言的鲁棒性以及大量嘈杂或原始文本的效率低下。相比之下，基于神经网络的模型可以自动捕获单词边界信息而无需先验知识。在本文中，我们提出了一个子词引导的神经单词细分器（SGNWS），该神经单词分段器将单词分割作为序列标记任务。 SGNWS模型通过双向长期记忆编码器，位置感知的自我注意力和条件随机场进行了子字表示学习。我们的经验结果表明，SGNWS模型在六个数据集上的信德单词分割中实现了最新的性能。

Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题