论文标题
使用子词表示学习和意识到自我注意
Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention
论文作者
论文摘要
由于空间遗漏和插入问题,Sindhi单词分割是一项具有挑战性的任务。信德语言本身增加了这种复杂性。它是草的,由具有固有的连接和非连接属性的字符组成,独立于单词边界。现有的sindhi单词分割方法依赖于设计和组合手工制作的功能。但是,这些方法具有局限性,例如难以处理量不计的单词,其他语言的鲁棒性以及大量嘈杂或原始文本的效率低下。相比之下,基于神经网络的模型可以自动捕获单词边界信息而无需先验知识。在本文中,我们提出了一个子词引导的神经单词细分器(SGNWS),该神经单词分段器将单词分割作为序列标记任务。 SGNWS模型通过双向长期记忆编码器,位置感知的自我注意力和条件随机场进行了子字表示学习。我们的经验结果表明,SGNWS模型在六个数据集上的信德单词分割中实现了最新的性能。
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.