论文标题
语言不可知论的多语言信息检索与对比度学习
Language Agnostic Multilingual Information Retrieval with Contrastive Learning
论文作者
论文摘要
多语言信息检索(IR)是具有挑战性的,因为带注释的培训数据以多种语言获得成本高昂。我们提出了一种有效的方法来培训多语言IR系统时,只有英语IR培训数据以及英语和其他语言之间的一些并行语料库。我们利用并行和非平行的语料库来提高预审预周读的多语言模型的跨语性转移能力。我们设计了一种语义对比度损失,以使平行句子的平行句子的表示形式相结合,这些句子以不同的语言共享相同的语义,以及一个新的语言对比损失,以利用并行句子对以从非平行语料库中删除句子表示中的特定语言信息。当对这些损失的英语IR数据进行培训并评估了非英语数据的零射门时,我们的模型证明了对先前的检索性能工作的显着改善,而计算努力则少得多。当仅适用于几种语言时,我们还证明了模型对实用环境的价值,但是对于许多其他低资源语言,缺乏平行的Corpora Resources仍然存在。即使使用少量并行句子,我们的模型也可以很好地工作,并用作任何骨架和其他任务的附加模块。
Multilingual information retrieval (IR) is challenging since annotated training data is costly to obtain in many languages. We present an effective method to train multilingual IR systems when only English IR training data and some parallel corpora between English and other languages are available. We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models' cross-lingual transfer ability. We design a semantic contrastive loss to align representations of parallel sentences that share the same semantics in different languages, and a new language contrastive loss to leverage parallel sentence pairs to remove language-specific information in sentence representations from non-parallel corpora. When trained on English IR data with these losses and evaluated zero-shot on non-English data, our model demonstrates significant improvement to prior work on retrieval performance, while it requires much less computational effort. We also demonstrate the value of our model for a practical setting when a parallel corpus is only available for a few languages, but a lack of parallel corpora resources persists for many other low-resource languages. Our model can work well even with a small number of parallel sentences, and be used as an add-on module to any backbones and other tasks.