论文标题
对弱监督实体匹配的地面真理推断
Ground Truth Inference for Weakly Supervised Entity Matching
论文作者
论文摘要
实体匹配(EM)是指在一个或多个关系表中识别数据记录对的问题,该数据记录是指现实世界中同一实体的一个或多个关系。当前有监督的机器学习(ML)模型目前实现最先进的匹配性能;但是,它们需要许多标记的示例,这些例子通常是昂贵或不可行的。这激发了我们使用弱监督使用EM的数据标记。特别是,我们使用浮潜普及的标签功能抽象,其中每个标签功能(LF)是一个用户提供的程序,可以快速,便宜地生成许多嘈杂的匹配/非匹配标签。给定一组用户编写的LFS,数据标记的质量取决于标记模型,以准确推断地面真相标签。在这项工作中,我们首先提出了一个简单但功能强大的标签模型,用于一般弱监督任务。然后,我们通过考虑EM特异性的传递性属性,专门针对实体匹配的任务定制标签模型。 我们的标签模型的一般形式很简单,同时大大优于十个一般弱监督数据集中最好的现有方法。为了量身定制EM的标签模型,我们制定了一种方法,以确保标签模型的最终预测满足EM所需的传递性属性,并在可能的情况下利用精确的解决方案,并在其余情况下使用基于ML的近似值。在两个单台和九个两个桌子现实世界的EM数据集中,我们表明我们的标签模型平均得出的F1得分比最佳现有方法高9%。我们还表明,在我们的弱监督方法生成的标签上训练的深度学习EM END模型(DeepMatcher)可与使用数以万计的基本真相标签训练的最终模型相当,这表明我们的方法可以大大减少EM所需的标签工作。
Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching performance; however, they require many labeled examples, which are often expensive or infeasible to obtain. This has inspired us to approach data labeling for EM using weak supervision. In particular, we use the labeling function abstraction popularized by Snorkel, where each labeling function (LF) is a user-provided program that can generate many noisy match/non-match labels quickly and cheaply. Given a set of user-written LFs, the quality of data labeling depends on a labeling model to accurately infer the ground-truth labels. In this work, we first propose a simple but powerful labeling model for general weak supervision tasks. Then, we tailor the labeling model specifically to the task of entity matching by considering the EM-specific transitivity property. The general form of our labeling model is simple while substantially outperforming the best existing method across ten general weak supervision datasets. To tailor the labeling model for EM, we formulate an approach to ensure that the final predictions of the labeling model satisfy the transitivity property required in EM, utilizing an exact solution where possible and an ML-based approximation in remaining cases. On two single-table and nine two-table real-world EM datasets, we show that our labeling model results in a 9% higher F1 score on average than the best existing method. We also show that a deep learning EM end model (DeepMatcher) trained on labels generated from our weak supervision approach is comparable to an end model trained using tens of thousands of ground-truth labels, demonstrating that our approach can significantly reduce the labeling efforts required in EM.