论文标题
Mixmt 2022的代码切换MT的域课程
Domain Curricula for Code-Switched MT at MixMT 2022
论文作者
论文摘要
在多语言口语设置中,构成包含代币或不同语言的短语的文本或语音表达式的习惯性,这种现象通常称为代码切换或代码混合(CMX)。我们在WMT 2022上介绍了代码混合机器翻译(MIXMT)共享任务的方法和结果:该任务由两个子任务组成,单语与代码混合的机器翻译(subtask-1)和代码混合到单语机器Translation(subtask-2)。大多数非合成代码混合数据来自社交媒体,但是收集大量此类数据将是艰苦的,并且这种形式的数据比其他域具有更大的写作变化,因此对于两个子任务,我们都尝试了数据时间表以获取外域数据。我们通过预处理和微调共同学习多个文本领域,并结合句子对准目标。我们发现,域之间的切换导致训练中最早看到的域的性能提高,但耗尽了其余域中的性能。通过策略性分配的不同领域的数据进行的连续培训表明,与微调相比,性能明显提高。
In multilingual colloquial settings, it is a habitual occurrence to compose expressions of text or speech containing tokens or phrases of different languages, a phenomenon popularly known as code-switching or code-mixing (CMX). We present our approach and results for the Code-mixed Machine Translation (MixMT) shared task at WMT 2022: the task consists of two subtasks, monolingual to code-mixed machine translation (Subtask-1) and code-mixed to monolingual machine translation (Subtask-2). Most non-synthetic code-mixed data are from social media but gathering a significant amount of this kind of data would be laborious and this form of data has more writing variation than other domains, so for both subtasks, we experimented with data schedules for out-of-domain data. We jointly learn multiple domains of text by pretraining and fine-tuning, combined with a sentence alignment objective. We found that switching between domains caused improved performance in the domains seen earliest during training, but depleted the performance on the remaining domains. A continuous training run with strategically dispensed data of different domains showed a significantly improved performance over fine-tuning.