论文标题
一个嵌入的所有单词嵌入
All Word Embeddings from One Embedding
论文作者
论文摘要
在基于神经网络的自然语言处理模型(NLP)中,参数的最大部分通常由单词嵌入组成。传统型号准备一个大嵌入矩阵,其大小取决于词汇尺寸。因此,将这些模型存储在内存和磁盘存储中的成本很高。在这项研究中,为了减少参数的总数,所有单词的嵌入都是通过转换共享嵌入来表示的。所提出的方法(单词嵌入一个单词嵌入)通过用滤波器向量修改共享嵌入来构建单词的嵌入,该滤波器是特定于单词的但不可训练的。然后,我们将构造的嵌入到前馈神经网络中以提高其表现力。天真地,滤波器向量占据与常规嵌入矩阵相同的内存大小,该矩阵取决于词汇大小。为了解决此问题,我们还引入了一种记忆效率的滤波器构造方法。我们表明,通过对预训练单词嵌入的重建实验,我们可以单独用作单词表示。此外,我们还对NLP应用程序任务进行实验:机器翻译和摘要。我们与当前最新的编码器模型(变压器)单独合并,并在WMT 2014英语翻译和DUC 2004上获得了可比的分数,而参数较少。
In neural network-based models for natural language processing (NLP), the largest part of the parameters often consists of word embeddings. Conventional models prepare a large embedding matrix whose size depends on the vocabulary size. Therefore, storing these models in memory and disk storage is costly. In this study, to reduce the total number of parameters, the embeddings for all words are represented by transforming a shared embedding. The proposed method, ALONE (all word embeddings from one), constructs the embedding of a word by modifying the shared embedding with a filter vector, which is word-specific but non-trainable. Then, we input the constructed embedding into a feed-forward neural network to increase its expressiveness. Naively, the filter vectors occupy the same memory size as the conventional embedding matrix, which depends on the vocabulary size. To solve this issue, we also introduce a memory-efficient filter construction approach. We indicate our ALONE can be used as word representation sufficiently through an experiment on the reconstruction of pre-trained word embeddings. In addition, we also conduct experiments on NLP application tasks: machine translation and summarization. We combined ALONE with the current state-of-the-art encoder-decoder model, the Transformer, and achieved comparable scores on WMT 2014 English-to-German translation and DUC 2004 very short summarization with less parameters.