论文标题

Simans:简单的模棱两可的负面抽样,以进行密集文本检索

SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval

论文作者

Zhou, Kun, Gong, Yeyun, Liu, Xiao, Zhao, Wayne Xin, Shen, Yelong, Dong, Anlei, Lu, Jingwen, Majumder, Rangan, Wen, Ji-Rong, Duan, Nan, Chen, Weizhu

论文摘要

从大型文档池中抽样适当的负面因素对于有效训练密集的检索模型至关重要。但是,现有的负面抽样策略遭受了非信息或假负问题的损失。在这项工作中,我们从经验上表明,根据测得的相关性分数,排名围绕这些积极因素的负面因素通常更有信息,并且不太可能是虚假负面因素。直观地,这些负面因素并不是太难(\ emph {可能是错误的负})或太容易了(\ emph {nornicnicative})。它们是模棱两可的负面因素,在训练过程中需要更多的关注。因此,我们提出了一种简单的模棱两可的负面抽样方法Simans,该方法结合了新的采样概率分布,以采样更模棱两可的负面因素。对四个公共和一个行业数据集进行了广泛的实验,显示了我们方法的有效性。我们在\ url {https://github.com/microsoft/simxns}中公开提供了代码和模型。

Sampling proper negatives from a large document pool is vital to effectively train a dense retrieval model. However, existing negative sampling strategies suffer from the uninformative or false negative problem. In this work, we empirically show that according to the measured relevance scores, the negatives ranked around the positives are generally more informative and less likely to be false negatives. Intuitively, these negatives are not too hard (\emph{may be false negatives}) or too easy (\emph{uninformative}). They are the ambiguous negatives and need more attention during training. Thus, we propose a simple ambiguous negatives sampling method, SimANS, which incorporates a new sampling probability distribution to sample more ambiguous negatives. Extensive experiments on four public and one industry datasets show the effectiveness of our approach. We made the code and models publicly available in \url{https://github.com/microsoft/SimXNS}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源