论文标题
了解单词嵌入的下游不稳定
Understanding the Downstream Instability of Word Embeddings
论文作者
论文摘要
许多工业机器学习(ML)系统需要频繁进行重新培训,以使数据不断变化。当今ML系统面临的重大挑战加剧了这一重大挑战:模型培训是不稳定的,即训练数据的小变化可能会导致模型的预测发生重大变化。在本文中,我们致力于对这种不稳定性有更深入的了解,重点是现代自然语言处理(NLP)管道的核心构建块---预训练的单词嵌入 - - 影响下游NLP模型的不稳定性。我们首先从经验上揭示了稳定性和记忆之间的权衡:增加嵌入记忆2倍可以将预测的分歧减少,这是由于训练数据的较小变化增加了5%至37%(相对)。为了从理论上解释这一权衡,我们引入了一种新的嵌入不稳定性的度量 - 特征空间不稳定性措施 - 我们证明,这是通过单词嵌入式变化引入的下游预测的分歧。实际上,我们表明特征空间不稳定性度量可以是选择嵌入参数以最小化不稳定性而无需训练下游模型,优于其他嵌入距离量度的方法并使用最接近的基于邻居的措施进行竞争性执行的一种成本效益的方式。最后,我们证明观察到的稳定内存权衡也扩展到其他类型的嵌入,包括知识图和上下文单词嵌入。
Many industrial machine learning (ML) systems require frequent retraining to keep up-to-date with constantly changing data. This retraining exacerbates a large challenge facing ML systems today: model training is unstable, i.e., small changes in training data can cause significant changes in the model's predictions. In this paper, we work on developing a deeper understanding of this instability, with a focus on how a core building block of modern natural language processing (NLP) pipelines---pre-trained word embeddings---affects the instability of downstream NLP models. We first empirically reveal a tradeoff between stability and memory: increasing the embedding memory 2x can reduce the disagreement in predictions due to small changes in training data by 5% to 37% (relative). To theoretically explain this tradeoff, we introduce a new measure of embedding instability---the eigenspace instability measure---which we prove bounds the disagreement in downstream predictions introduced by the change in word embeddings. Practically, we show that the eigenspace instability measure can be a cost-effective way to choose embedding parameters to minimize instability without training downstream models, outperforming other embedding distance measures and performing competitively with a nearest neighbor-based measure. Finally, we demonstrate that the observed stability-memory tradeoffs extend to other types of embeddings as well, including knowledge graph and contextual word embeddings.