Layoutlmv3：具有统一文本和图像掩蔽的文档AI的预培训

论文标题

Layoutlmv3：具有统一文本和图像掩蔽的文档AI的预培训

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

论文作者

Huang, Yupan, Lv, Tengchao, Cui, Lei, Lu, Yutong, Wei, Furu

论文摘要

自我监督的预训练技术在文档AI中取得了显着的进步。大多数多模式的预训练模型都使用蒙版的语言建模目标来学习文本模式上的双向表示，但是图像模式的训练预训练目标有所不同。这种差异增加了多模式表示学习的困难。在本文中，我们建议\ textbf {layoutlmv3}为具有统一文本和图像掩模的文档AI进行培训的多模式变压器。此外，LayoutlMv3是通过单词斑点对齐目标进行预训练的，可以通过预测是否掩盖文本的相应图像贴片来学习跨模式对齐。简单的统一体系结构和培训目标使Layoutlmv3成为以文本为中心和以图像为中心的文档AI任务的通用预培训模型。实验结果表明，LayoutlMV3不仅在以文本为中心的任务中实现了最先进的绩效，包括形成理解，收据理解和文档视觉问题回答，还可以在以图像图像分类和文档布局分析等任务中进行。代码和模型可在\ url {https://aka.ms/layoutlmv3}上公开获得。

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at \url{https://aka.ms/layoutlmv3}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题