论文标题
Cebuano的基线可读性模型
A Baseline Readability Model for Cebuano
论文作者
论文摘要
在这项研究中,我们开发了Cebuano语言的第一个基线可读性模型。宿务公司是菲律宾第二种最常用的母语,大约有2750万人。作为基准,我们从宿务校的拼字法中提取了传统或基于表面的特征,以及基于宿务的拼字法的音节图案,以及来自多语言BERT模型的神经嵌入。结果表明,在所有指标中,使用前两个手工制作的语言特征是在优化的随机森林模型上获得的最佳性能,该表现训练有大约87%。所使用的功能集和算法也类似于菲律宾语言的可读性评估的先前结果,显示了跨语言应用的潜力。为了鼓励更多的菲律宾语言可读性评估的工作,我们开源了代码和数据。
In this study, we developed the first baseline readability model for the Cebuano language. Cebuano is the second most-used native language in the Philippines with about 27.5 million speakers. As the baseline, we extracted traditional or surface-based features, syllable patterns based from Cebuano's documented orthography, and neural embeddings from the multilingual BERT model. Results show that the use of the first two handcrafted linguistic features obtained the best performance trained on an optimized Random Forest model with approximately 87% across all metrics. The feature sets and algorithm used also is similar to previous results in readability assessment for the Filipino language showing potential of crosslingual application. To encourage more work for readability assessment in Philippine languages such as Cebuano, we open-sourced both code and data.