文本到语音的非标准越南单词检测和标准化

论文标题

文本到语音的非标准越南单词检测和标准化

Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech

论文作者

Dang, Huu-Tien, Vuong, Thi-Hai-Yen, Phan, Xuan-Hieu

论文摘要

在任何文本到语音（TTS）系统中，将书面文本转换为他们的口语形式是一个必不可少的问题。 However, building an effective text normalization solution for a real-world TTS system face two main challenges: (1) the semantic ambiguity of non-standard words (NSWs), e.g., numbers, dates, ranges, scores, abbreviations, and (2) transforming NSWs into pronounceable syllables, such as URL, email address, hashtag, and contact name.在本文中，我们提出了一种应对这些挑战的新两相规范化方法。首先，一个基于模型的标签机旨在检测新南威尔士州。然后，根据新南威尔士州类型，基于规则的标准器将这些新南威尔士州的最终语言形式扩展到其最终的口头形式。我们在手动注释的数据集上使用条件随机场（CRF），Bilstm-CNN-CRF和Bert-Bigru-CRF模型进行了三个经验实验，用于新南威尔士州检测，包括从越南新闻文章中提取的5819个句子。在第二阶段，我们提出了一种基于前向词典的最大匹配算法，以将主题标签，电子邮件，URL和联系人名称拆分。标记阶段的实验结果表明，Bilstm-CNN-CRF和CRF模型的平均F1得分高于90.00％，而Bert-Bigru-CRF模型达到了95.00％的最高F1。总体而言，我们的方法的句子错误率较低，CRF为8.15％，而Bilstm-CNN-CRF标记器为7.11％，而Bert-Bigru-Crf Tagger只有6.67％。

Converting written texts into their spoken forms is an essential problem in any text-to-speech (TTS) systems. However, building an effective text normalization solution for a real-world TTS system face two main challenges: (1) the semantic ambiguity of non-standard words (NSWs), e.g., numbers, dates, ranges, scores, abbreviations, and (2) transforming NSWs into pronounceable syllables, such as URL, email address, hashtag, and contact name. In this paper, we propose a new two-phase normalization approach to deal with these challenges. First, a model-based tagger is designed to detect NSWs. Then, depending on NSW types, a rule-based normalizer expands those NSWs into their final verbal forms. We conducted three empirical experiments for NSW detection using Conditional Random Fields (CRFs), BiLSTM-CNN-CRF, and BERT-BiGRU-CRF models on a manually annotated dataset including 5819 sentences extracted from Vietnamese news articles. In the second phase, we propose a forward lexicon-based maximum matching algorithm to split down the hashtag, email, URL, and contact name. The experimental results of the tagging phase show that the average F1 scores of the BiLSTM-CNN-CRF and CRF models are above 90.00%, reaching the highest F1 of 95.00% with the BERT-BiGRU-CRF model. Overall, our approach has low sentence error rates, at 8.15% with CRF and 7.11% with BiLSTM-CNN-CRF taggers, and only 6.67% with BERT-BiGRU-CRF tagger.

下载PDF全文

下载文献需遵守相关版权规定

论文标题