Bhojpuri，Magahi和Maithili的语言资源：有关它们的统计数据，它们的相似性估计以及三个应用程序的基线

论文标题

Bhojpuri，Magahi和Maithili的语言资源：有关它们的统计数据，它们的相似性估计以及三个应用程序的基线

Linguistic Resources for Bhojpuri, Magahi and Maithili: Statistics about them, their Similarity Estimates, and Baselines for Three Applications

论文作者

Mundotiya, Rajesh Kumar, Singh, Manish Kumar, Kapur, Rahul, Mishra, Swasti, Singh, Anil Kumar

论文摘要

为低资源语言准备语料库，并开发人类语言技术来分析或计算处理它们是一项艰巨的任务，这主要是由于专家语言学家无法获得这些语言的母语者，也是由于所需的时间和资源。 Bhojpuri，Magahi和Maithili是印度Purvanchal地区的语言（在东北地区），是属于印度 - 雅利安（或指示）家族的低资源语言。它们与印地语密切相关，印地语是一种相对较高的资源语言，这就是为什么我们与印地语进行比较的原因。我们从各种来源收集了这三种语言的语料库，并尽可能地清理它们，而无需更改其中的数据。该文本属于不同的领域和流派。我们在字符，单词，音节和词素级别上计算了这些语料库的一些基本统计指标。这些语料库还带有言论（POS）和块标签的注释。基本的统计措施既是绝对的又是相对的，并且被挖出以表明语言特性，例如形态学，词汇，语音和句法复杂性（或丰富性）。将结果与标准的印地语语料库进行了比较。对于大多数措施，我们尝试在语言上的语料库大小相同，以避免语料库大小的影响，但是在某些情况下，事实证明，即使大小差异很大，使用完整的语料库也更好。尽管结果不是很清楚，但我们试图就语言和语料库得出一些结论。对于POS标记和块，使用BIS标记集来手动注释数据。 POS标记的数据大小分别为Bhojpuri，Magahi和Maithili，分别为16067、14669和12310句子。块的尺寸分别为Bhojpuri和Maithili的9695和1954年的句子。

Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were exptected to indicate of linguistic properties such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to the corpus size the same across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we try to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS tagged data sizes are 16067, 14669 and 12310 sentences, respectively, for Bhojpuri, Magahi and Maithili. The sizes for chunking are 9695 and 1954 sentences for Bhojpuri and Maithili, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题