论文标题

HC4:临时Clir的新型测试集

HC4: A New Suite of Test Collections for Ad Hoc CLIR

论文作者

Lawrie, Dawn, Mayfield, James, Oard, Douglas, Yang, Eugene

论文摘要

HC4是一套新的用于临时跨语言信息检索(CLIR)的测试收集套件,其中包含中文,波斯语和俄语的常见爬网新闻文件,英语和文档语言的主题,以及分级相关性判断。需要新的测试收集,因为使用传统Clir运行池构建的现有CLIR测试集合在评估神经CLIR方法时具有系统的差异。 HC4收集包含60个主题,每个中国和波斯语中包含约54个文档,以及54个主题和500万个文件。主动学习用于确定使用交互式搜索和判断播种后要注释的文档。文档以三年级相关性量表进行了评判。本文描述了新测试集的设计和构建,并提供了基线结果,以证明其评估系统的实用性。

HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document languages, and graded relevance judgments. New test collections are needed because existing CLIR test collections built using pooling of traditional CLIR runs have systematic gaps in their relevance judgments when used to evaluate neural CLIR methods. The HC4 collections contain 60 topics and about half a million documents for each of Chinese and Persian, and 54 topics and five million documents for Russian. Active learning was used to determine which documents to annotate after being seeded using interactive search and judgment. Documents were judged on a three-grade relevance scale. This paper describes the design and construction of the new test collections and provides baseline results for demonstrating their utility for evaluating systems.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源