论文标题
非常低的资源句子对准:卢哈和斯瓦希里语
Very Low Resource Sentence Alignment: Luhya and Swahili
论文作者
论文摘要
由激光和LABSE等预先训练的模型生成的语言敏锐的句子嵌入是挖掘大型数据集以生产平行语料库来用于低资源机器翻译的有吸引力的选择。我们在提取两种相关的低资源非洲语言的Bitext中测试激光和LaBSE:Luhya和Swahili。在这项工作中,我们创建了一组新的平行组,其中包括8000个Luhya-English句子,该句子允许对激光和LABSE进行新的零射击测试。我们发现,Labse在两种语言上都大大优于激光。然而,激光和LABSE在Luhya上的零射线对准方面的表现较差,仅达到1.5%和22.0%的成功对准(P@1得分)。我们将嵌入在一小部分平行的luhya句子上,并显示出明显的收益,将LABSE比对精度提高到53.3%。此外,将数据集限制为固定在0.7以上的余弦对的句子,得出的对齐方式超过85%。
Language-agnostic sentence embeddings generated by pre-trained models such as LASER and LaBSE are attractive options for mining large datasets to produce parallel corpora for low-resource machine translation. We test LASER and LaBSE in extracting bitext for two related low-resource African languages: Luhya and Swahili. For this work, we created a new parallel set of nearly 8000 Luhya-English sentences which allows a new zero-shot test of LASER and LaBSE. We find that LaBSE significantly outperforms LASER on both languages. Both LASER and LaBSE however perform poorly at zero-shot alignment on Luhya, achieving just 1.5% and 22.0% successful alignments respectively (P@1 score). We fine-tune the embeddings on a small set of parallel Luhya sentences and show significant gains, improving the LaBSE alignment accuracy to 53.3%. Further, restricting the dataset to sentence embedding pairs with cosine similarity above 0.7 yielded alignments with over 85% accuracy.