角色伯特：将Elmo和Bert与字符中的单词级开放式录像带进行调和

论文标题

角色伯特：将Elmo和Bert与字符中的单词级开放式录像带进行调和

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

论文作者

Boukkouri, Hicham El, Ferret, Olivier, Lavergne, Thomas, Noji, Hiroshi, Zweigenbaum, Pierre, Tsujii, Junichi

论文摘要

由于伯特（Bert）带来了引人注目的改进，许多最近的表示模型采用了变压器体系结构作为其主要构建块，因此尽管与变压器的概念没有内在链接，但仍继承了文字式令牌化系统。尽管该系统被认为可以在字符的灵活性和完整单词的效率之间取得良好的平衡，但使用来自通用域中的预定义文字词汇并不总是合适的，尤其是在为专业领域构建模型时（例如，医疗领域）。此外，采用文字令牌化将焦点从单词级别转移到子词级别，从而使模型在概念上更加复杂，并且在实践中可以说明不方便。由于这些原因，我们提出了bertembert，这是伯特的新变体，它完全丢弃了文字的系统，并使用字符-CNN模块而不是通过咨询其角色来表示整个单词。我们表明，这种新模型改善了BERT在各种医疗领域任务上的性能，同时又产生了强大的文字级别和开放式唱片代表。

Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level and open-vocabulary representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题