论文标题
通过公司的言语了解以及附近还有什么
To Know by the Company Words Keep and What Else Lies in the Vicinity
论文作者
论文摘要
最新的(SOTA)自然语言处理(NLP)系统的开发一直在稳步建立新技术来吸收语言数据的统计数据。这些技术通常从传统理论中追踪着知名的结构,我们研究了这些连接,以弥合关键NLP方法的差距,以此作为定向未来工作的手段。为此,我们介绍了通过开创性算法(包括手套和Word2Vec)学到的统计数据的分析模型,并为使用这些算法和共同出现的统计数据提供了洞察力。在这项工作中,据我们所知,我们得出了Word2Vec的SoftMax优化,跳过算法的第一个已知解决方案。该结果为未来发展提供了令人兴奋的潜力,作为对深度学习(DL)语言模型(LM)矩阵分解的直接解决方案。但是,我们使用该解决方案来证明单词矢量所展示的属性的看似普遍存在,并允许在数据中对偏差进行预防性的识别 - 在DL模型吸收之前。为了使我们的工作有资格,我们对共同发生模型中的统计依赖性密度进行了分析,这反过来又对分布假设的部分实现了共发生的统计。
The development of state-of-the-art (SOTA) Natural Language Processing (NLP) systems has steadily been establishing new techniques to absorb the statistics of linguistic data. These techniques often trace well-known constructs from traditional theories, and we study these connections to close gaps around key NLP methods as a means to orient future work. For this, we introduce an analytic model of the statistics learned by seminal algorithms (including GloVe and Word2Vec), and derive insights for systems that use these algorithms and the statistics of co-occurrence, in general. In this work, we derive -- to the best of our knowledge -- the first known solution to Word2Vec's softmax-optimized, skip-gram algorithm. This result presents exciting potential for future development as a direct solution to a deep learning (DL) language model's (LM's) matrix factorization. However, we use the solution to demonstrate a seemingly-universal existence of a property that word vectors exhibit and which allows for the prophylactic discernment of biases in data -- prior to their absorption by DL models. To qualify our work, we conduct an analysis of independence, i.e., on the density of statistical dependencies in co-occurrence models, which in turn renders insights on the distributional hypothesis' partial fulfillment by co-occurrence statistics.