layoutlmv2：多模式的预训练，用于视觉富裕文档的理解

论文标题

layoutlmv2：多模式的预训练，用于视觉富裕文档的理解

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

论文作者

Xu, Yang, Xu, Yiheng, Lv, Tengchao, Cui, Lei, Wei, Furu, Wang, Guoxin, Lu, Yijuan, Florencio, Dinei, Zhang, Cha, Che, Wanxiang, Zhang, Min, Zhou, Lidong

论文摘要

由于其有效的模型体系结构以及大规模未标记的扫描/数字出生的文档的优势，因此事实证明，文本和布局的预培训在各种视觉上的文档理解任务中被证明有效。我们提出了具有新的预训练任务的LayoutLMV2体系结构，以模拟单个多模式框架中文本，布局和图像之间的相互作用。具体而言，LayoutLMV2不仅使用现有的掩盖视觉语言建模任务，还使用新的文本图像对齐和文本图像匹配任务，使其更好地捕获预训练阶段的交叉模式交互。同时，它还将空间感知的自我注意力发注意机制集成到变压器架构中，以便模型可以完全了解不同文本块之间的相对位置关系。实验结果表明，LayoutLMV2的表现优于Layoutlm，并在各种下游的视觉上含量丰富的文档理解任务上获得了新的最新结果，包括FUNSD（0.7895 $ \ \ \ $ 0.8420），CORD（0.9493 $ \ \ \ 0.9601），$ 0.9601），SROIE（0.9601）（0.9524 $ 0.9524 $ 0.9524）， kleister-nda（0.8340 $ \至$ 0.8520），rvl-cdip（0.9443 $ \至$ 0.9564）和docvqa（0.7295 $ \ \ $ \ $ \ $ \ $ 0.8672）。我们在\ url {https://aka.ms/layoutlmv2}上公开提供了模型和代码。

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$ 0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520), RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made our model and code publicly available at \url{https://aka.ms/layoutlmv2}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题