使用深度学习和基于规则的更正购买文档中的关键信息提取

论文标题

使用深度学习和基于规则的更正购买文档中的关键信息提取

Key Information Extraction in Purchase Documents using Deep Learning and Rule-based Corrections

论文作者

Arroyo, Roberto, Yebes, Javier, Martínez, Elena, Corrales, Héctor, Lorenzo, Javier

论文摘要

最近，深度学习（DL）主要是自然语言处理（NLP）和计算机视觉（CV）的领域。但是，DL通常依赖大型数据注释的可用性，因此其他替代或基于互补的基于模式的技术可以帮助改善结果。在本文中，我们使用基于DL和基于规则的更正在购买文档中的关键信息提取（KIE）基础。我们的系统最初信任基于实体标签的光学特征识别（OCR）和文本理解，以识别感兴趣的购买事实（例如，产品代码，描述，数量或价格）。然后将这些事实链接到同一产品组，该群体通过线路检测和一些分组启发式方法识别。一旦处理了这些DL方法，我们就会贡献几种机制，包括基于规则的校正来改善基线DL预测。我们证明了这些基于规则的校正对基线DL提供的增强功能在从公共和Nielseniq数据集的购买文档的实验中导致了基线DL。

Deep Learning (DL) is dominating the fields of Natural Language Processing (NLP) and Computer Vision (CV) in the recent times. However, DL commonly relies on the availability of large data annotations, so other alternative or complementary pattern-based techniques can help to improve results. In this paper, we build upon Key Information Extraction (KIE) in purchase documents using both DL and rule-based corrections. Our system initially trusts on Optical Character Recognition (OCR) and text understanding based on entity tagging to identify purchase facts of interest (e.g., product codes, descriptions, quantities, or prices). These facts are then linked to a same product group, which is recognized by means of line detection and some grouping heuristics. Once these DL approaches are processed, we contribute several mechanisms consisting of rule-based corrections for improving the baseline DL predictions. We prove the enhancements provided by these rule-based corrections over the baseline DL results in the presented experiments for purchase documents from public and NielsenIQ datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题