论文标题

Meshup:全文生物医学文档索引的语料库

MeSHup: A Corpus for Full Text Biomedical Document Indexing

论文作者

Wang, Xindi, Mercer, Robert E., Rudzicz, Frank

论文摘要

医学主题标题(网格)索引是指为特定的生物医学文档带有最相关的标签的问题,这些标签是非常大的网格术语。目前,PubMed数据库中大量的生物医学文章是由人类策展人手动注释的,这是耗时且昂贵的。因此,可以协助索引的计算系统非常有价值。当开发监督的网格索引系统时,需要大规模注释的文本语料库的可用性。公开可用的大型语料库,允许对各种系统进行强有力的评估和比较,对研究界很重要。我们发布了一个大规模注释的网格索引语料库,Meshup包含1,342,667个英语全文文章,以及相关的网格标签和元数据,作者以及从MEDLINE数据库收集的出版物场所。我们训练一种端到端模型,该模型结合了文档及其相关标签的功能,并报告了新的基线。

Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number of biomedical articles in the PubMed database are manually annotated by human curators, which is time consuming and costly; therefore, a computational system that can assist the indexing is highly valuable. When developing supervised MeSH indexing systems, the availability of a large-scale annotated text corpus is desirable. A publicly available, large corpus that permits robust evaluation and comparison of various systems is important to the research community. We release a large scale annotated MeSH indexing corpus, MeSHup, which contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database. We train an end-to-end model that combines features from documents and their associated labels on our corpus and report the new baseline.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源