抹茶：通过数学推理和图表drendering预测视觉语言

论文标题

抹茶：通过数学推理和图表drendering预测视觉语言

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

论文作者

Liu, Fangyu, Piccinno, Francesco, Krichene, Syrine, Pang, Chenxi, Lee, Kenton, Joshi, Mandar, Altun, Yasemin, Collier, Nigel, Eisenschlos, Julian Martin

论文摘要

视觉语言数据（例如情节，图表和信息图表）在人类世界中无处不在。但是，最新的视觉模型在这些数据上表现不佳。我们提出抹茶（数学推理和图表衍生训练），以增强视觉语言模型在共同建模图表/图和语言数据中的功能。具体而言，我们提出了几个审计任务，这些任务涵盖了绘图解构和数值推理，这是视觉语言建模中的关键功能。我们从最近提出的图像到文本视觉语言模型开始进行抹茶预处理。在诸如PlotQA和ChartQA之类的标准基准上，Matcha模型的表现优于最先进的方法多达20％。我们还研究了对屏幕截图，教科书图和文档图形等领域的抹茶进行预处理的方式，并观察到整体改进，从而验证了抹茶对更广泛的视觉语言任务的有用性。

Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose MatCha (Math reasoning and Chart derendering pretraining) to enhance visual language models' capabilities in jointly modeling charts/plots and language data. Specifically, we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recently proposed image-to-text visual language model. On standard benchmarks such as PlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MatCha pretraining transfers to domains such as screenshots, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MatCha pretraining on broader visual language tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题