论文标题
学术检索中语言模型的效率低下:实验性演练
The Inefficiency of Language Models in Scholarly Retrieval: An Experimental Walk-through
论文作者
论文摘要
语言模型越来越在AI驱动的科学IR系统中变得流行。本文评估了处理中流行的科学语言模型(i)短期文本和(ii)文本邻居。我们的实验展示了即使在最轻松的条件下,也无法检索短期文本的相关文档。此外,我们利用小扰动对原始文本产生的文本邻居,以证明并非所有扰动都会导致嵌入空间中的邻居。此外,详尽的分类产生了几类在拼字图和语义上相关,部分相关和完全无关的邻居。检索性能事实更受表面形式而不是文本语义的影响。
Language models are increasingly becoming popular in AI-powered scientific IR systems. This paper evaluates popular scientific language models in handling (i) short-query texts and (ii) textual neighbors. Our experiments showcase the inability to retrieve relevant documents for a short-query text even under the most relaxed conditions. Additionally, we leverage textual neighbors, generated by small perturbations to the original text, to demonstrate that not all perturbations lead to close neighbors in the embedding space. Further, an exhaustive categorization yields several classes of orthographically and semantically related, partially related, and completely unrelated neighbors. Retrieval performance turns out to be more influenced by the surface form rather than the semantics of the text.