关于自适应数据收集对于极不平衡的成对任务的重要性

论文标题

关于自适应数据收集对于极不平衡的成对任务的重要性

On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

论文作者

Mussmann, Stephen, Jia, Robin, Liang, Percy

论文摘要

许多成对的分类任务，例如释义检测和开放域问题，自然具有极端标签的不平衡（例如，$ 99.99 \％的示例示例是负面的）。相比之下，许多最近的数据集启发性地选择示例以确保标签平衡。我们表明，这些启发式方法会导致训练较差的训练模型：在QQP和Wikiqa上训练的最先进的模型在对现实不平衡的测试数据进行评估时，每个人的平均精度仅为$ 2.4 \％\％$。相反，我们使用基于BERT的嵌入模型来收集训练数据，从而有效地从非常无标记的话语对中检索不确定的点。通过创建平衡的培训数据，具有更有信息的负面示例，主动学习可以大大提高QQP的平均精度为$ 32.5 \％$，Wikiqa上的$ 20.1 \％$ $。

Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., $99.99\%$ of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only $2.4\%$ average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to $32.5\%$ on QQP and $20.1\%$ on WikiQA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题