使用基于实例学习的语言分类

论文标题

使用基于实例学习的语言分类

Linguistic Classification using Instance-Based Learning

论文作者

Nayak, Priya S., Girdhar, Rhythm, Prabhu, Shreekanth M.

论文摘要

传统上，语言学家将世界上的语言组织为以树木为模型的语言家族。在这项工作中，我们采用逆势方法，并质疑基于树的模型相当限制。例如，使用网络模型更好地说明了梵语与跨印欧语语言的语言具有独立语言的亲和力。我们可以说，关于印度语言之间的相互关系，在印度之间，相互关系比假设更好。为了实现这样的发现，在本文中，我们利用基于实例的学习技术将语言标签分配给单词。我们通过使用我们的自定义语言距离指标来表达每个单词，然后对其进行分类，相对于包含语言标签的训练集，该单词的自定义语言距离指标。我们通过使用单词簇并将语言和类别标签分配给该集群来构建培训集。此外，我们利用聚类系数作为我们研究的质量指标。我们认为我们的工作有可能引入语言学的新时代。我们限制了印度重要语言的这项工作。可以通过应用Adaboost进行分类以及社交网络分析的结构等效概念来进一步加强这项工作。

Traditionally linguists have organized languages of the world as language families modelled as trees. In this work we take a contrarian approach and question the tree-based model that is rather restrictive. For example, the affinity that Sanskrit independently has with languages across Indo-European languages is better illustrated using a network model. We can say the same about inter-relationship between languages in India, where the inter-relationships are better discovered than assumed. To enable such a discovery, in this paper we have made use of instance-based learning techniques to assign language labels to words. We vocalize each word and then classify it by making use of our custom linguistic distance metric of the word relative to training sets containing language labels. We construct the training sets by making use of word clusters and assigning a language and category label to that cluster. Further, we make use of clustering coefficients as a quality metric for our research. We believe our work has the potential to usher in a new era in linguistics. We have limited this work for important languages in India. This work can be further strengthened by applying Adaboost for classification coupled with structural equivalence concepts of social network analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题