论文标题

分析双重编码器的鲁棒性,以进行拼写错误

Analysing the Robustness of Dual Encoders for Dense Retrieval Against Misspellings

论文作者

Sidiropoulos, Georgios, Kanoulas, Evangelos

论文摘要

密集的检索正成为文档和通过排名的标准方法之一。双重编码器体系结构由于其效率和高性能而被广泛用于评分问题对。通常,在干净和策划的数据集上评估密集的检索模型。但是,当部署在现实生活中时,这些模型会遇到嘈杂的用户生成的文本。也就是说,在暴露于嘈杂的文字时,最先进的浓缩犬的性能可能会大大恶化。在这项工作中,我们研究了密集检索器对用户问题中的错别字的鲁棒性。在遇到错别字时,我们观察到双重编码模型的性能显着下降,并探索通过将数据增强与对比度学习结合在一起来改善其鲁棒性的方法。我们对两个大规模段落排名和开放域问题答案数据集的实验表明,我们所提出的方法的表现优于竞争方法。此外,我们对鲁棒性进行详尽的分析。最后,我们提供了有关不同错别字如何影响嵌入的鲁棒性的见解,以及我们的方法如何减轻某些错别字的效果而不是其他错别字的效果。

Dense retrieval is becoming one of the standard approaches for document and passage ranking. The dual-encoder architecture is widely adopted for scoring question-passage pairs due to its efficiency and high performance. Typically, dense retrieval models are evaluated on clean and curated datasets. However, when deployed in real-life applications, these models encounter noisy user-generated text. That said, the performance of state-of-the-art dense retrievers can substantially deteriorate when exposed to noisy text. In this work, we study the robustness of dense retrievers against typos in the user question. We observe a significant drop in the performance of the dual-encoder model when encountering typos and explore ways to improve its robustness by combining data augmentation with contrastive learning. Our experiments on two large-scale passage ranking and open-domain question answering datasets show that our proposed approach outperforms competing approaches. Additionally, we perform a thorough analysis on robustness. Finally, we provide insights on how different typos affect the robustness of embeddings differently and how our method alleviates the effect of some typos but not of others.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源