论文标题
关于自适应数据收集对于极不平衡的成对任务的重要性
On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks
论文作者
论文摘要
许多成对的分类任务,例如释义检测和开放域问题,自然具有极端标签的不平衡(例如,$ 99.99 \%的示例示例是负面的)。相比之下,许多最近的数据集启发性地选择示例以确保标签平衡。我们表明,这些启发式方法会导致训练较差的训练模型:在QQP和Wikiqa上训练的最先进的模型在对现实不平衡的测试数据进行评估时,每个人的平均精度仅为$ 2.4 \%\%$。相反,我们使用基于BERT的嵌入模型来收集训练数据,从而有效地从非常无标记的话语对中检索不确定的点。通过创建平衡的培训数据,具有更有信息的负面示例,主动学习可以大大提高QQP的平均精度为$ 32.5 \%$,Wikiqa上的$ 20.1 \%$ $。
Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., $99.99\%$ of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only $2.4\%$ average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to $32.5\%$ on QQP and $20.1\%$ on WikiQA.