论文标题

多种哈希嵌入

Multi hash embeddings in spaCy

论文作者

Miranda, Lester James, Kádár, Ákos, Boyd, Adriane, Van Landeghem, Sofie, Søgaard, Anders, Honnibal, Matthew

论文摘要

符号的分布式表示形式是当今机器学习系统中的关键技术之一,在现代自然语言处理中起着关键作用。传统单词嵌入将一个单独的向量与每个单词相关联。尽管这种方法很简单,并且可以带来良好的性能,但它需要大量的内存才能代表大型词汇。为了减少内存足迹,Spacy中的默认嵌入层是哈希嵌入层。这是传统嵌入式的随机近似值,为大量单词提供了独特的向量,而无需明确存储每个单词。为了能够计算已知单词和未知单词的有意义的表示,哈希嵌入表示每个单词作为归一化词形式,子单词信息和单词形状的摘要。这些功能共同产生一个单词的多件。在这份技术报告中,我们详细介绍了一些历史记录,并详细介绍了嵌入方法。其次,我们对来自各种域和语言的命名实体识别数据集上的多件架构进行了批判性评估嵌入体系结构。该实验验证了Spacy嵌入式背后的大多数关键设计选择,但我们也发现了一些令人惊讶的结果。

The distributed representation of symbols is one of the key technologies in machine learning systems today, playing a pivotal role in modern natural language processing. Traditional word embeddings associate a separate vector with each word. While this approach is simple and leads to good performance, it requires a lot of memory for representing a large vocabulary. To reduce the memory footprint, the default embedding layer in spaCy is a hash embeddings layer. It is a stochastic approximation of traditional embeddings that provides unique vectors for a large number of words without explicitly storing a separate vector for each of them. To be able to compute meaningful representations for both known and unknown words, hash embeddings represent each word as a summary of the normalized word form, subword information and word shape. Together, these features produce a multi-embedding of a word. In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail. Second, we critically evaluate the hash embedding architecture with multi-embeddings on Named Entity Recognition datasets from a variety of domains and languages. The experiments validate most key design choices behind spaCy's embedders, but we also uncover a few surprising results.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源