论文标题
多模式变压器话语级代码转换检测
Multi-Modal Transformers Utterance-Level Code-Switching Detection
论文作者
论文摘要
包含多种语言的语音的话语被称为代码开关句子。在这项工作中,我们提出了一种新颖的技术,以预测给定音频是单语言还是代码开关。我们通过利用音素信息以及用于代码开关检测的音频功能来提出一种多模式学习方法。我们的模型由处理音素序列和音频网络(AN)的音素网络组成,该网络处理MFCC功能。我们融合了从两个网络中学到的表示,以预测话语是否是代码开关的。音频网络和语音网络由初始卷积,BI-LSTM和变压器编码器层组成。变压器编码器层有助于选择重要和相关的功能,以通过自我注意力来更好地分类。我们表明,利用语音的音素序列以及MFCC功能可显着提高代码 - 开关检测的性能。我们在泰卢固语,泰米尔语和古吉拉特语语言的Microsoft Code-Switching挑战数据集上训练和评估我们的模型。我们的实验表明,多模式学习方法比泰卢固语 - 英语,古吉拉特语英语和泰米尔语 - 英语数据集的单模式方法显着提高了准确性。我们还使用不同的神经层研究系统性能,并表明变压器有助于获得更好的性能。
An utterance that contains speech from multiple languages is known as a code-switched sentence. In this work, we propose a novel technique to predict whether given audio is mono-lingual or code-switched. We propose a multi-modal learning approach by utilising the phoneme information along with audio features for code-switch detection. Our model consists of a Phoneme Network that processes phoneme sequence and Audio Network(AN), which processes the mfcc features. We fuse representation learned from both the Networks to predict if the utterance is code-switched or not. The Audio Network and Phonetic Network consist of initial convolution, Bi-LSTM, and transformer encoder layers. The transformer encoder layer helps in selecting important and relevant features for better classification by using self-attention. We show that utilising the phoneme sequence of the utterance along with the mfcc features improves the performance of code-switch detection significantly. We train and evaluate our model on Microsoft code-switching challenge datasets for Telugu, Tamil, and Gujarati languages. Our experiments show that the multi-modal learning approach significantly improved accuracy over the uni-modal approaches for Telugu-English, Gujarati-English, and Tamil-English datasets. We also study the system performance using different neural layers and show that the transformers help obtain better performance.