Ban-CAP：多功能英文包式图像描述数据集

论文标题

Ban-CAP：多功能英文包式图像描述数据集

BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset

论文作者

Khan, Mohammad Faiyaz, Shifath, S. M. Sadiq-Ur-Rahman, Islam, Md Saiful

论文摘要

随着计算机在理解视觉信息并将其转换为书面表示方面有效，对自动图像字幕等任务的研究兴趣在过去几年中取得了重大飞跃。尽管大多数研究的关注是在单语言环境中给予英语的，但像孟加拉这样的资源受限语言仍然不在焦点中，这主要是由于缺乏标准数据集。在解决此问题时，我们介绍了一个新的数据集禁令帽，遵循广泛使用的FlickR8K数据集，在该数据集中，我们收集了合格注释者提供的图像的孟加拉字幕。我们的数据集代表了来自不同背景的受过训练的人注释的各种图像标题样式。我们提供了对数据集的定量和定性分析，以及对孟加拉图像字幕中最新模型的基线评估。我们研究了文本增强的效果，并证明了一种基于自适应注意的模型与使用上下文化的单词替换（CWR）相结合的文本增强效果优于孟加拉图像字幕的所有最新模型。我们还介绍了该数据集的多功能性质，尤其是在Bangla-English和English-Bangla的机器翻译上。该数据集和所有模型将对进一步的研究有用。

As computers have become efficient at understanding visual information and transforming it into a written representation, research interest in tasks like automatic image captioning has seen a significant leap over the last few years. While most of the research attention is given to the English language in a monolingual setting, resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets. Addressing this issue, we present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators. Our dataset represents a wider variety of image caption styles annotated by trained people from different backgrounds. We present a quantitative and qualitative analysis of the dataset and the baseline evaluation of the recent models in Bangla image captioning. We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning. We also present this dataset's multipurpose nature, especially on machine translation for Bangla-English and English-Bangla. This dataset and all the models will be useful for further research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题