论文标题
TeamTat:协作文本注释工具
TeamTat: a collaborative text annotation tool
论文作者
论文摘要
手动注释数据是开发文本挖掘和信息算法算法的关键。但是,人类注释需要大量的时间,精力和专业知识。鉴于生物医学文献的快速增长,建立促进速度和保持专家质量的工具至关重要。虽然现有的文本注释工具可以为域专家提供用户友好的接口,但可用于图像显示,项目管理和多用户团队注释有限的支持。作为回应,我们开发了TeamTat(TeamTat.org),这是一种基于Web的注释工具(可用的本地设置),配备了牵头,有效地管理团队注释项目。 TeamTat是一种用于管理多用户多标签文档注释的新工具,反映了整个生产生命周期。项目经理可以为实体和关系指定注释模式,并选择注释器并匿名分发文档以防止偏见。文档输入格式可以是纯文本,PDF或BIOC(本地上载或自动从PubMed或PMC检索),并且输出格式为带有内联注释的Bioc。 TeamTat显示出全文中的数字,以方便注释。多个用户可以在其工作空间中独立处理同一文档,并且团队经理可以跟踪任务完成。 TeamTat通过通知协议统计数据提供语料库质量评估,以及用户友好的接口方便注释审查和通知者间分歧分辨率,以提高语料库质量。
Manually annotated data is key to developing text-mining and information-extraction algorithms. However, human annotation requires considerable time, effort and expertise. Given the rapid growth of biomedical literature, it is paramount to build tools that facilitate speed and maintain expert quality. While existing text annotation tools may provide user-friendly interfaces to domain experts, limited support is available for image display, project management, and multi-user team annotation. In response, we developed TeamTat (teamtat.org), a web-based annotation tool (local setup available), equipped to manage team annotation projects engagingly and efficiently. TeamTat is a novel tool for managing multi-user, multi-label document annotation, reflecting the entire production life cycle. Project managers can specify annotation schema for entities and relations and select annotator(s) and distribute documents anonymously to prevent bias. Document input format can be plain text, PDF or BioC, (uploaded locally or automatically retrieved from PubMed or PMC), and output format is BioC with inline annotations. TeamTat displays figures from the full text for the annotators convenience. Multiple users can work on the same document independently in their workspaces, and the team manager can track task completion. TeamTat provides corpus-quality assessment via inter-annotator agreement statistics, and a user-friendly interface convenient for annotation review and inter-annotator disagreement resolution to improve corpus quality.