简单还是复杂？学习预测孟加拉文本的可读性

论文标题

简单还是复杂？学习预测孟加拉文本的可读性

Simple or Complex? Learning to Predict Readability of Bengali Texts

论文作者

Chakraborty, Susmoy, Nayeem, Mir Tafseer, Ahmad, Wasi Uddin

论文摘要

确定文本的可读性是简化其简化的第一步。在本文中，我们提出了一个可读性分析工具，能够分析用孟加拉语编写的文本，以提供有关其可读性和复杂性的深入信息。尽管孟加拉人是世界上第七位口语中排名第七的语言，但孟加拉语仍缺乏自然语言处理的基本资源。到目前为止，孟加拉语与孟加拉语的可读性研究可以被认为是由于缺乏资源而被认为是狭窄的，有时是错误的。因此，通过适当的年龄比较，我们正确地采用了传统上用于美国教育系统的文档级可读性公式。由于大型人类通知的语料库不可用，我们将文档级任务进一步划分为句子级别，并对神经体系结构进行实验，这将是孟加拉语可读性预测未来作品的基准。在此过程中，我们介绍了几个人类宣传的语料库和词典，例如文档级数据集，其中包含618个文档，具有12个不同的年级，一个大规模的句子级数据集，其中包含96K句子，其中包含简单和复杂的标签，一个简单且复杂的标签，一个辅助的conjunct conjunct conjunct conterm count cont anggorithm and a cons ang s ang ang ang 341 ang ang ang ang ang ang 3 ang ang 3 and ang 3 341 and ang ang ang ang ang 3 and ang 3 and ang ang 3 341 congity ang ang 3 341个词。单词，以及具有超过67k单词的更新发音字典。这些资源对于这种低资源语言的其他几个任务可能很有用。我们可以在https://github.com/tafseer-nayeem/bengalereadability}上公开提供代码和数据集。

Determining the readability of a text is the first step to its simplification. In this paper, we present a readability analysis tool capable of analyzing text written in the Bengali language to provide in-depth information on its readability and complexity. Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing. Readability related research of the Bengali language so far can be considered to be narrow and sometimes faulty due to the lack of resources. Therefore, we correctly adopt document-level readability formulas traditionally used for U.S. based education system to the Bengali language with a proper age-to-age comparison. Due to the unavailability of large-scale human-annotated corpora, we further divide the document-level task into sentence-level and experiment with neural architectures, which will serve as a baseline for the future works of Bengali readability prediction. During the process, we present several human-annotated corpora and dictionaries such as a document-level dataset comprising 618 documents with 12 different grade levels, a large-scale sentence-level dataset comprising more than 96K sentences with simple and complex labels, a consonant conjunct count algorithm and a corpus of 341 words to validate the effectiveness of the algorithm, a list of 3,396 easy words, and an updated pronunciation dictionary with more than 67K words. These resources can be useful for several other tasks of this low-resource language. We make our Code & Dataset publicly available at https://github.com/tafseer-nayeem/BengaliReadability} for reproduciblity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题