论文标题
矩阵 - 用于信息提取的模态感知变压器
MATrIX -- Modality-Aware Transformer for Information eXtraction
论文作者
论文摘要
我们提出矩阵 - 一种模态感知的变压器,用于视觉文档理解(VDU)域中的信息提取。 VDU涵盖了从视觉上丰富的文档中提取的信息,例如表单,发票,收据,表,图表,演示文稿或广告。在这些中,文本语义和视觉信息相互补充,以提供对文档的全球理解。矩阵以一种无监督的方式进行了预训练,该任务需要使用多模式信息(空间,视觉或文本)。我们在单个令牌集中立即考虑一次空间和文本模式。为了使注意力更加灵活,我们在注意机制中使用学习的模态感知的相对偏见来调节不同方式的标记之间的注意力。我们在3个不同的数据集上评估矩阵,每个数据集都具有强大的基准。
We present MATrIX - a Modality-Aware Transformer for Information eXtraction in the Visual Document Understanding (VDU) domain. VDU covers information extraction from visually rich documents such as forms, invoices, receipts, tables, graphs, presentations, or advertisements. In these, text semantics and visual information supplement each other to provide a global understanding of the document. MATrIX is pre-trained in an unsupervised way with specifically designed tasks that require the use of multi-modal information (spatial, visual, or textual). We consider the spatial and text modalities all at once in a single token set. To make the attention more flexible, we use a learned modality-aware relative bias in the attention mechanism to modulate the attention between the tokens of different modalities. We evaluate MATrIX on 3 different datasets each with strong baselines.