BLOOM库：用于多种下游任务的300多种语言的多模式数据集

论文标题

BLOOM库：用于多种下游任务的300多种语言的多模式数据集

Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks

论文作者

Leong, Colin, Nemecek, Joshua, Mansdorfer, Jacob, Filighera, Anna, Owodunni, Abraham, Whitenack, Daniel

论文摘要

我们提供Bloom库，这是一套语言多种模式和多语言数据集，用于语言建模，图像字幕，视觉讲故事和语音综合/识别。这些数据集代表每个随附的下游任务中最多或最多的多语言数据集。总的来说，Bloom库数据集的初始版本涵盖了32个语言系列的363种语言。我们为数据中表示的各种语言训练下游任务模型，以低资源，多模式NLP的未来工作的数据可行性，并建立了这些下游任务的首个已知基线，以某些语言（例如BISU [BZI]，估计为700个用户）。这些首先的基准中的一些基线可与更高资源的语言的最先进性能相媲美。 Bloom库数据集在拥抱面式数据集枢纽的Creative Commons许可下发布，以催化随附的下游任务中更语言多样化的研究。

We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题