论文标题
Kenswquad-一个问题回答斯瓦希里语低资源语言的数据集
KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language
论文作者
论文摘要
对低资源语言回答数据集的问题的需求是这项研究的动机,导致Kencorpus Swahili问题答复数据集的Kenswquad。该数据集是从斯瓦希里语低资源语言的原始故事文本注释,该文本主要在东非和世界其他地区使用。问答(QA)数据集对于机器理解自然语言的任务非常重要,例如互联网搜索和对话系统。机器学习系统需要培训数据,例如本研究中开发的黄金标准问答集。该研究介入注释者,从肯尼亚语言语料库Kencorpus Project收集的斯瓦希里语文本中制定了质量检查对。该项目注释了总共2,585个文本中的1,445个文本,每个文本至少为5 QA对,导致最终数据集为7,526 QA Pairs。带注释的文本中有12.5%的质量保证套件证实了QA对都正确注释了。将集合应用于QA任务的概念证明证实,数据集可以用于此类任务。 Kenswquad还为Swahili语言的资源提供了贡献。
The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.