论文标题

tamasheq语言中的语音资源

Speech Resources in the Tamasheq Language

论文作者

Boito, Marcely Zanon, Bougares, Fethi, Barbier, Florentin, Gahbiche, Souhir, Barrault, Loïc, Rouvier, Mickael, Estève, Yannick

论文摘要

在本文中,我们介绍了Tamasheq的两个数据集,Tamasheq是一种主要在马里和尼日尔使用的语言。这两个数据集可用于IWSLT 2022低资源语音翻译曲目,其中包括来自尼日尔(Studio Kalangou)和马里(Studio tamani)的Daily Broadcast News收集的广播录音。我们以五种语言共享(i)大量的无标记的音频数据(671小时):来自尼日尔,Fulfulde,Hausa,Hausa,Tamasheq和Zarma的法语,以及(II)在Tamasheq中使用较小的17小时的平行音频录音,并带有法语的言语级翻译。所有这些数据均在创意共享下通过NC-ND 3.0许可证共享。我们希望这些资源能够激发语音社区使用Tamasheq语言开发和基准模型。

In this paper we present two datasets for Tamasheq, a developing language mainly spoken in Mali and Niger. These two datasets were made available for the IWSLT 2022 low-resource speech translation track, and they consist of collections of radio recordings from daily broadcast news in Niger (Studio Kalangou) and Mali (Studio Tamani). We share (i) a massive amount of unlabeled audio data (671 hours) in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma, and (ii) a smaller 17 hours parallel corpus of audio recordings in Tamasheq, with utterance-level translations in the French language. All this data is shared under the Creative Commons BY-NC-ND 3.0 license. We hope these resources will inspire the speech community to develop and benchmark models using the Tamasheq language.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源