廉价的域名语言模型的适应：生物医学NER和COVID-19 QA的案例研究

论文标题

廉价的域名语言模型的适应：生物医学NER和COVID-19 QA的案例研究

Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA

论文作者

Poerner, Nina, Waltinger, Ulli, Schütze, Hinrich

论文摘要

经过预告片模型（PTLM）的域适应性通常是通过对目标域文本进行预定的预定来实现的。尽管成功，但就硬件，运行时和CO_2排放而言，这种方法很昂贵。在这里，我们提出了一个更便宜的替代方案：我们在目标域文本上训练Word2Vec，并将结果的单词向量与通用域PTLM的文字向量保持一致。我们对八个生物医学命名实体识别（NER）任务进行评估，并与最近提出的生物Biobert模型进行比较。我们覆盖了Biobert-Bert F1 Delta的60％以上，占Biobert CO_2足迹的5％，其2％的云计算成本的2％。我们还展示了如何快速将现有的通用域问题答案（QA）模型转化为新兴领域：COVID-19大流行。

Domain adaptation of Pretrained Language Models (PTLMs) is typically achieved by unsupervised pretraining on target-domain text. While successful, this approach is expensive in terms of hardware, runtime and CO_2 emissions. Here, we propose a cheaper alternative: We train Word2Vec on target-domain text and align the resulting word vectors with the wordpiece vectors of a general-domain PTLM. We evaluate on eight biomedical Named Entity Recognition (NER) tasks and compare against the recently proposed BioBERT model. We cover over 60% of the BioBERT-BERT F1 delta, at 5% of BioBERT's CO_2 footprint and 2% of its cloud compute cost. We also show how to quickly adapt an existing general-domain Question Answering (QA) model to an emerging domain: the Covid-19 pandemic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题