论文标题

部分可观测时空混沌系统的无模型预测

Automated classification for open-ended questions with BERT

论文作者

Gweon, Hyukjun, Schonlau, Matthias

论文摘要

从开放式问题到不同类别的文本数据的手动编码是耗时且昂贵的。自动编码使用统计/机器学习来训练一小部分手动编码的文本答案。最近,对大量无关数据进行预训练,然后将模型改编为特定应用程序已被证明在自然语言处理中有效。使用两个数据集,我们从经验上研究了当前主要培训的语言模型BERT是否比其他未经培训的统计学习方法更有效地对开放式问题的答案进行自动编码。我们发现对预训练的BERT参数进行微调至关重要,因为否则BERT的参数不具有竞争力。其次,我们发现微调的伯特(Bert)在对100个手动编码的观测值进行培训时,几乎没有通过分类精度来击败未经训练的统计学习方法。但是,当更多手动编码的观测值(例如200-400)可用于训练时,BERT的相对优势会迅速提高。我们得出的结论是,对于对开放式问题的自动编码答案,BERT比未经预告的模型(例如支持向量机和增强)更可取。

Manual coding of text data from open-ended questions into different categories is time consuming and expensive. Automated coding uses statistical/machine learning to train on a small subset of manually coded text answers. Recently, pre-training a general language model on vast amounts of unrelated data and then adapting the model to the specific application has proven effective in natural language processing. Using two data sets, we empirically investigate whether BERT, the currently dominant pre-trained language model, is more effective at automated coding of answers to open-ended questions than other non-pre-trained statistical learning approaches. We found fine-tuning the pre-trained BERT parameters is essential as otherwise BERT's is not competitive. Second, we found fine-tuned BERT barely beats the non-pre-trained statistical learning approaches in terms of classification accuracy when trained on 100 manually coded observations. However, BERT's relative advantage increases rapidly when more manually coded observations (e.g. 200-400) are available for training. We conclude that for automatically coding answers to open-ended questions BERT is preferable to non-pretrained models such as support vector machines and boosting.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源