翻译：基于变压器的语言知情梵语令牌

论文标题

翻译：基于变压器的语言知情梵语令牌

TransLIST: A Transformer-Based Linguistically Informed Sanskrit Tokenizer

论文作者

Sandhan, Jivnesh, Singha, Rathin, Rao, Narein, Samanta, Suvendu, Behera, Laxmidhar, Goyal, Pawan

论文摘要

梵语单词细分（SWS）对于使数字化文本可用和部署下游任务至关重要。但是，由于Sandhi现象会在界限上修改字符，因此需要特殊处理，因此这是不平凡的。现有的词汇驱动的SWS驱动方法利用了梵语遗产读取器，词典驱动的浅解析器，以生成完整的候选解决方案空间，在这些候选解决方案空间上，使用各种方法来生成最有效的解决方案。但是，这些方法在遇到烟雾库的代币时失败。另一方面，SWS的纯工程方法利用了深度学习的最新进展，但不能利用有关可用性的潜在单词信息。 To mitigate the shortcomings of both families of approaches, we propose Transformer based Linguistically Informed Sanskrit Tokenizer (TransLIST) consisting of (1) a module that encodes the character input along with latent-word information, which takes into account the sandhi phenomenon specific to SWS and is apt to work with partial or no candidate solutions, (2) a novel soft-masked attention to prioritize potential candidate words and （3）一种新的路径排名算法，以纠正损坏的预测。 SWS的基准数据集上的实验表明，翻译人员的表现平均优于当前的最新系统，而完美匹配（PM）指标的绝对增益平均为7.2点。代码库和数据集可在https://github.com/rsingha108/translist上公开获得

Sanskrit Word Segmentation (SWS) is essential in making digitized texts available and in deploying downstream tasks. It is, however, non-trivial because of the sandhi phenomenon that modifies the characters at the word boundaries, and needs special treatment. Existing lexicon driven approaches for SWS make use of Sanskrit Heritage Reader, a lexicon-driven shallow parser, to generate the complete candidate solution space, over which various methods are applied to produce the most valid solution. However, these approaches fail while encountering out-of-vocabulary tokens. On the other hand, purely engineering methods for SWS have made use of recent advances in deep learning, but cannot make use of the latent word information on availability. To mitigate the shortcomings of both families of approaches, we propose Transformer based Linguistically Informed Sanskrit Tokenizer (TransLIST) consisting of (1) a module that encodes the character input along with latent-word information, which takes into account the sandhi phenomenon specific to SWS and is apt to work with partial or no candidate solutions, (2) a novel soft-masked attention to prioritize potential candidate words and (3) a novel path ranking algorithm to rectify the corrupted predictions. Experiments on the benchmark datasets for SWS show that TransLIST outperforms the current state-of-the-art system by an average 7.2 points absolute gain in terms of perfect match (PM) metric. The codebase and datasets are publicly available at https://github.com/rsingha108/TransLIST

下载PDF全文

下载文献需遵守相关版权规定

论文标题