带引用嵌入的科学文档表示的邻里对比学习

论文标题

带引用嵌入的科学文档表示的邻里对比学习

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings

论文作者

Ostendorff, Malte, Rethmeier, Nils, Augenstein, Isabelle, Gipp, Bela, Rehm, Georg

论文摘要

学习科学文档表示形式可以通过对比的学习目标实质上改善，在这些目标中，挑战在于创建正面和负面的培训样本来编码所需的相似性语义。先前的工作依靠离散的引文关系来生成对比样本。但是，离散的引用会导致相似之处的艰难截止。这与基于相似性的学习相反，并且忽略了尽管缺乏直接引用，但科学论文可能会非常相似 - 找到相关研究的核心问题。取而代之的是，我们将受控的最近的邻居采样在引文图嵌入中进行对比学习。这种控制使我们能够学习持续的相似性，采样难以学习的负面因素和阳性，并通过控制它们之间的采样余量来避免阴性和积极样本之间的碰撞。最终的方法粘附在SCIDOCS基准测试上的最先进。此外，我们证明它可以训练（或调音）模型样本，并且可以与最近的训练有效方法结合使用。也许令人惊讶的是，即使以这种方式培训通用域语言模型，也要超过预域中识别的基准。

Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning, and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) models sample-efficiently, and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题