OCR改善低资源语言的机器翻译

论文标题

OCR改善低资源语言的机器翻译

OCR Improves Machine Translation for Low-Resource Languages

论文作者

Ignat, Oana, Maillard, Jean, Chaudhary, Vishrav, Guzmán, Francisco

论文摘要

我们旨在研究当前OCR系统在低资源语言和低资源脚本上的性能。我们在低资源脚本中介绍并公开提供了一种新颖的基准OCR4MT，由真实和合成数据富含噪声的真实和合成数据组成。我们在基准上评估了最新的OCR系统，并分析了最常见的错误。我们表明，OCR单语言数据是一种有价值的资源，当用于撤退时，可以提高机器翻译模型的性能。然后，我们进行一项消融研究，以研究OCR错误如何影响机器翻译性能，并确定单语言数据对机器翻译有用的最低OCR质量水平。

We aim to investigate the performance of current OCR systems on low resource languages and low resource scripts. We introduce and make publicly available a novel benchmark, OCR4MT, consisting of real and synthetic data, enriched with noise, for 60 low-resource languages in low resource scripts. We evaluate state-of-the-art OCR systems on our benchmark and analyse most common errors. We show that OCR monolingual data is a valuable resource that can increase performance of Machine Translation models, when used in backtranslation. We then perform an ablation study to investigate how OCR errors impact Machine Translation performance and determine what is the minimum level of OCR quality needed for the monolingual data to be useful for Machine Translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题