论文标题
余弦作为高频单词相似性嵌入相似性的问题
Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
论文作者
论文摘要
上下文嵌入的余弦相似性用于许多NLP任务(例如QA,IR,MT)和指标(例如BertScore)。在这里,我们发现了系统的方法,即余弦在bert嵌入中估计的单词相似性被低估了,并将这种效果追溯到训练数据频率。我们发现,相对于人类的判断,余弦相似性低估了频繁单词与其他跨环境中的其他单词或其他单词的相似性,即使在控制了多义和其他因素之后。我们猜想高频单词对相似性的低估是由于高频和低频单词的表示几何形状的差异,并为二维情况提供了正式的论点。
Cosine similarity of contextual embeddings is used in many NLP tasks (e.g., QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.