论文标题

通过最佳传输神经机器翻译的词汇学习

Vocabulary Learning via Optimal Transport for Neural Machine Translation

论文作者

Xu, Jingjing, Zhou, Hao, Gan, Chun, Zheng, Zaixiang, Li, Lei

论文摘要

令牌词汇的选择会影响机器翻译的性能。本文旨在弄清楚什么是一个好的词汇,以及如果没有试用培训,是否可以找到最佳的词汇。为了回答这些问题,我们首先从信息理论的角度提供了对词汇作用的另一种理解。在此激励的情况下,我们制定了对词汇化的追求 - 找到适当大小的最佳令牌词典 - 作为最佳运输(OT)问题。我们提出了Volt,这是一个没有试验训练的简单有效的解决方案。经验结果表明,在包括WMT-14英语 - 德国人和Ted的52个翻译方向在内的不同情况下,Volt的表现优于广泛使用的词汇。例如,伏特在英语 - 德语翻译上实现了近70%的词汇尺寸减小和0.5 bleu的增长。同样,与BPE搜索相比,Volt的搜索时间从384 GPU小时减少到英语 - 德语翻译的30 GPU小时。代码可在https://github.com/jingjing-nlp/volt上找到。

The choice of token vocabulary affects the performance of machine translation. This paper aims to figure out what is a good vocabulary and whether one can find the optimal vocabulary without trial training. To answer these questions, we first provide an alternative understanding of the role of vocabulary from the perspective of information theory. Motivated by this, we formulate the quest of vocabularization -- finding the best token dictionary with a proper size -- as an optimal transport (OT) problem. We propose VOLT, a simple and efficient solution without trial training. Empirical results show that VOLT outperforms widely-used vocabularies in diverse scenarios, including WMT-14 English-German and TED's 52 translation directions. For example, VOLT achieves almost 70% vocabulary size reduction and 0.5 BLEU gain on English-German translation. Also, compared to BPE-search, VOLT reduces the search time from 384 GPU hours to 30 GPU hours on English-German translation. Codes are available at https://github.com/Jingjing-NLP/VOLT .

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源