论文标题

Eeny,Meeny,Miny,Moe。如何选择形态变化的数据

Eeny, meeny, miny, moe. How to choose data for morphological inflection

论文作者

Muradoglu, Saliha, Hulden, Mans

论文摘要

在众多自然语言处理(NLP)任务中,数据稀缺是一个普遍存在的问题。在形态学中,对NLP和语言文档的劳动密集型工作是严重的瓶颈。主动学习(AL)旨在通过选择最有用的改进模型的数据来降低数据注释的成本。在本文中,我们探索了使用变压器模型的四个采样策略,以实现形态学变化的任务:一对甲骨文实验,其中根据模型是否已经可以正确地转化了数据,并基于高/低模型置信度,熵,熵,以及随机选择。我们研究了30种类型多样性的语言中每种策略的鲁棒性。我们还对Natügu进行了更深入的案例研究。我们的结果显示了基于模型置信度和熵选择数据的明显好处。毫不奇怪,Oracle实验只选择了错误处理的表格进行进一步培训,该培训是语言学家/语言顾问反馈的代理,显示出最大的进步。紧随其后的是选择低信心和高透镜预测。我们还表明,尽管较大的数据集的传统智慧产生了更好的准确性,引入了更多的高信任或低渗透形式的实例,或者该模型已经可以正确地转化的形式可以降低模型性能。

Data scarcity is a widespread problem in numerous natural language processing (NLP) tasks for low-resource languages. Within morphology, the labour-intensive work of tagging/glossing data is a serious bottleneck for both NLP and language documentation. Active learning (AL) aims to reduce the cost of data annotation by selecting data that is most informative for improving the model. In this paper, we explore four sampling strategies for the task of morphological inflection using a Transformer model: a pair of oracle experiments where data is chosen based on whether the model already can or cannot inflect the test forms correctly, as well as strategies based on high/low model confidence, entropy, as well as random selection. We investigate the robustness of each strategy across 30 typologically diverse languages. We also perform a more in-depth case study of Natügu. Our results show a clear benefit to selecting data based on model confidence and entropy. Unsurprisingly, the oracle experiment, where only incorrectly handled forms are chosen for further training, which is presented as a proxy for linguist/language consultant feedback, shows the most improvement. This is followed closely by choosing low-confidence and high-entropy predictions. We also show that despite the conventional wisdom of larger data sets yielding better accuracy, introducing more instances of high-confidence or low-entropy forms, or forms that the model can already inflect correctly, can reduce model performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源