论文标题
基于大规模数据集的实用中国依赖性解析器
A Practical Chinese Dependency Parser Based on A Large-scale Dataset
论文作者
论文摘要
依赖解析是一项长期的自然语言处理任务,其输出对各种下游任务至关重要。最近,基于神经网络(基于NN)的依赖性解析取得了重大进展,并获得了最先进的结果。众所周知,基于NN的方法需要大量标记的培训数据,这非常昂贵,因为它需要专家的人类注释。因此,很少有面向工业的依赖解析器工具可公开使用。在本报告中,我们提出了BAIDU依赖解析器(DDPARSER),这是一款新的中国依赖解析器,该解析器在一个大规模手动标记的数据集中训练,称为Baidu Chinese Treebank(Ductb)。 Ductb由大约一百万个带注释的句子组成,包括搜索日志,中国新闻,各种论坛话语和对话计划。 DDPARSER扩展在基于图的Biaffine解析器上,以适应中国数据集的特征。我们对两个测试集进行实验:标准测试集具有与训练集相同的分布和从其他来源采样的随机测试集,并且标记的附件得分(LAS)分别为92.9%和86.9%。 DDPARSER可实现最新的结果,并在https://github.com/baidu/ddparser上发布。
Dependency parsing is a longstanding natural language processing task, with its outputs crucial to various downstream tasks. Recently, neural network based (NN-based) dependency parsing has achieved significant progress and obtained the state-of-the-art results. As we all know, NN-based approaches require massive amounts of labeled training data, which is very expensive because it requires human annotation by experts. Thus few industrial-oriented dependency parser tools are publicly available. In this report, we present Baidu Dependency Parser (DDParser), a new Chinese dependency parser trained on a large-scale manually labeled dataset called Baidu Chinese Treebank (DuCTB). DuCTB consists of about one million annotated sentences from multiple sources including search logs, Chinese newswire, various forum discourses, and conversation programs. DDParser is extended on the graph-based biaffine parser to accommodate to the characteristics of Chinese dataset. We conduct experiments on two test sets: the standard test set with the same distribution as the training set and the random test set sampled from other sources, and the labeled attachment scores (LAS) of them are 92.9% and 86.9% respectively. DDParser achieves the state-of-the-art results, and is released at https://github.com/baidu/DDParser.