论文标题
廉价的域名语言模型的适应:生物医学NER和COVID-19 QA的案例研究
Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA
论文作者
论文摘要
经过预告片模型(PTLM)的域适应性通常是通过对目标域文本进行预定的预定来实现的。尽管成功,但就硬件,运行时和CO_2排放而言,这种方法很昂贵。在这里,我们提出了一个更便宜的替代方案:我们在目标域文本上训练Word2Vec,并将结果的单词向量与通用域PTLM的文字向量保持一致。我们对八个生物医学命名实体识别(NER)任务进行评估,并与最近提出的生物Biobert模型进行比较。我们覆盖了Biobert-Bert F1 Delta的60%以上,占Biobert CO_2足迹的5%,其2%的云计算成本的2%。我们还展示了如何快速将现有的通用域问题答案(QA)模型转化为新兴领域:COVID-19大流行。
Domain adaptation of Pretrained Language Models (PTLMs) is typically achieved by unsupervised pretraining on target-domain text. While successful, this approach is expensive in terms of hardware, runtime and CO_2 emissions. Here, we propose a cheaper alternative: We train Word2Vec on target-domain text and align the resulting word vectors with the wordpiece vectors of a general-domain PTLM. We evaluate on eight biomedical Named Entity Recognition (NER) tasks and compare against the recently proposed BioBERT model. We cover over 60% of the BioBERT-BERT F1 delta, at 5% of BioBERT's CO_2 footprint and 2% of its cloud compute cost. We also show how to quickly adapt an existing general-domain Question Answering (QA) model to an emerging domain: the Covid-19 pandemic.