论文标题
CSCD-NS:中文拼写检查数据集针对母语者
CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers
论文作者
论文摘要
在本文中,我们提出了CSCD-NS,这是为母语者设计的第一个中国拼写检查(CSC)数据集,其中包含来自中国社交平台的40,000个样本。与针对中国学习者的现有CSC数据集相比,CSCD-NS的规模大十倍,并且表现出明显的错误分布,并且单词级错误的比例明显更高。为了进一步增强数据资源,我们提出了一种新颖的方法,该方法通过输入方法模拟输入过程,生成大规模和高质量的伪数据,这些数据与实际错误分布非常相似,并优于现有方法。此外,我们在这种情况下研究了各种模型的性能,包括大语言模型(LLMS),例如chatgpt。结果表明,由于严格的长度和发音约束,生成模型表现不佳类似BERT的分类模型。文字级别错误的高流行也使以人为挑战的母语者的CSC为CSC,为改进留出了很大的改进空间。
In this paper, we present CSCD-NS, the first Chinese spelling check (CSC) dataset designed for native speakers, containing 40,000 samples from a Chinese social platform. Compared with existing CSC datasets aimed at Chinese learners, CSCD-NS is ten times larger in scale and exhibits a distinct error distribution, with a significantly higher proportion of word-level errors. To further enhance the data resource, we propose a novel method that simulates the input process through an input method, generating large-scale and high-quality pseudo data that closely resembles the actual error distribution and outperforms existing methods. Moreover, we investigate the performance of various models in this scenario, including large language models (LLMs), such as ChatGPT. The result indicates that generative models underperform BERT-like classification models due to strict length and pronunciation constraints. The high prevalence of word-level errors also makes CSC for native speakers challenging enough, leaving substantial room for improvement.