为数字时代准备一种濒临灭绝的语言：犹太 - 西班牙的情况

论文标题

为数字时代准备一种濒临灭绝的语言：犹太 - 西班牙的情况

Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

论文作者

Öktem, Alp, Zevallos, Rodolfo, Moslem, Yasmin, Öztürk, Güneş, Şarhon, Karen

论文摘要

我们开发了机器翻译和语音合成系统，以补充振兴犹太人 - 西班牙的努力，犹太 - 西班牙语是Sephardic犹太人流放的语言，该语言幸存了几个世纪，但现在面临数字时代灭绝的威胁。我们建立在土耳其和其他地方的Sephardic社区创建的资源的基础上，我们创建了Corpora和工具，可以为子孙后代保留此语言。对于机器翻译，我们首先开发了一个基于犹太 - 西班牙规则的机器翻译系统，以便在相关语言对中生成大量的合成平行数据：土耳其语，英语和西班牙语。然后，我们使用这些合成数据和由Sephardic社区翻译创建的真实并行数据来训练基线神经机器翻译引擎。对于文本到语音的综合，我们提出了一个3.5小时的单扬声器语音语料库，用于构建神经语音综合引擎。资源，模型权重和在线推理引擎公开共享。

We develop machine translation and speech synthesis systems to complement the efforts of revitalizing Judeo-Spanish, the exiled language of Sephardic Jews, which survived for centuries, but now faces the threat of extinction in the digital age. Building on resources created by the Sephardic community of Turkey and elsewhere, we create corpora and tools that would help preserve this language for future generations. For machine translation, we first develop a Spanish to Judeo-Spanish rule-based machine translation system, in order to generate large volumes of synthetic parallel data in the relevant language pairs: Turkish, English and Spanish. Then, we train baseline neural machine translation engines using this synthetic data and authentic parallel data created from translations by the Sephardic community. For text-to-speech synthesis, we present a 3.5 hour single speaker speech corpus for building a neural speech synthesis engine. Resources, model weights and online inference engines are shared publicly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题