重新采样和过滤的折衷分类不平衡

论文标题

重新采样和过滤的折衷分类不平衡

Tradeoffs in Resampling and Filtering for Imbalanced Classification

论文作者

Muther, Ryan, Smith, David

论文摘要

分类问题在自然语言处理中非常普遍，并且使用各种重新采样和过滤技术解决，这通常涉及决定如何选择训练数据或决定该模型应标记哪些测试示例。我们研究了训练样本和过滤器训练和测试数据中的模型性能的权衡，以严重失衡的令牌分类任务，并检查这些权衡的幅度与感兴趣现象的基本率之间的关系。在对序列标签的实验以检测英语和阿拉伯语文本中罕见现象的实验，我们发现选择训练数据的不同方法带来了有效性和效率方面的权衡。我们还看到，在高度不平衡的情况下，使用第一频繁检索模型过滤测试数据对于模型性能与选择训练数据一样重要。稀有正类别的基本率对训练或测试数据的选择引起的性能变化的大小有明显的影响。随着基本利率的增加，这些选择带来的差异也会下降。

Imbalanced classification problems are extremely common in natural language processing and are solved using a variety of resampling and filtering techniques, which often involve making decisions on how to select training data or decide which test examples should be labeled by the model. We examine the tradeoffs in model performance involved in choices of training sample and filter training and test data in heavily imbalanced token classification task and examine the relationship between the magnitude of these tradeoffs and the base rate of the phenomenon of interest. In experiments on sequence tagging to detect rare phenomena in English and Arabic texts, we find that different methods of selecting training data bring tradeoffs in effectiveness and efficiency. We also see that in highly imbalanced cases, filtering test data using first-pass retrieval models is as important for model performance as selecting training data. The base rate of a rare positive class has a clear effect on the magnitude of the changes in performance caused by the selection of training or test data. As the base rate increases, the differences brought about by those choices decreases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题