论文标题
语言技术的标记为Te ReoMāori英语双语数据库的开发
The Development of a Labelled te reo Māori-English Bilingual Database for Language Technology
论文作者
论文摘要
新西兰的土著语言的Te Reo毛利人(称为毛利人)在语言技术方面的资源不足。毛利人的演讲者是双语的,毛利人在那里用英语进行了代码开关。不幸的是,毛利语对毛利语技术的语言技术,语言检测和毛利语对之间的代码 - 开关检测的可用资源最少。英语和毛利人都使用罗马衍生的拼字法制作基于规则的系统来检测语言和代码转换限制性。大多数毛利语言检测是由语言专家手动完成的。这项研究构建了一个带有单词级语言注释的66,016,807个单词的毛利语双语数据库。新西兰议会汉萨德辩论报告用于构建数据库。语言标签是使用特定语言规则和专家手动注释分配的。毛利语和英语的单词具有相同的拼写,但含义不同。这些词不能根据单词级的语言规则归类为毛利人或英语。因此,需要手动注释。还报道了报告数据库的各个方面的分析,例如元数据,逐年分析,经常出现的单词,句子长度和n-grams。这里开发的数据库是新西兰Aotearoa的未来语言和语音技术开发的宝贵工具。遵循标记数据库的方法也可以遵循其他低资源的语言对。
Te reo Māori (referred to as Māori), New Zealand's indigenous language, is under-resourced in language technology. Māori speakers are bilingual, where Māori is code-switched with English. Unfortunately, there are minimal resources available for Māori language technology, language detection and code-switch detection between Māori-English pair. Both English and Māori use Roman-derived orthography making rule-based systems for detecting language and code-switching restrictive. Most Māori language detection is done manually by language experts. This research builds a Māori-English bilingual database of 66,016,807 words with word-level language annotation. The New Zealand Parliament Hansard debates reports were used to build the database. The language labels are assigned using language-specific rules and expert manual annotations. Words with the same spelling, but different meanings, exist for Māori and English. These words could not be categorised as Māori or English based on word-level language rules. Hence, manual annotations were necessary. An analysis reporting the various aspects of the database such as metadata, year-wise analysis, frequently occurring words, sentence length and N-grams is also reported. The database developed here is a valuable tool for future language and speech technology development for Aotearoa New Zealand. The methodology followed to label the database can also be followed by other low-resourced language pairs.