战略数据收集可以改善贫困预测模型的性能吗？

论文标题

战略数据收集可以改善贫困预测模型的性能吗？

Can Strategic Data Collection Improve the Performance of Poverty Prediction Models?

论文作者

Soman, Satej, Aiken, Emily, Rolf, Esther, Blumenstock, Joshua

论文摘要

基于机器学习的贫困和财富估计越来越多地被用来指导人道主义援助的靶向和社会援助的分配。但是，用于训练这些模型的地面真相标签通常是从旨在产生国家统计数据的现有调查中借来的，而不是训练机器学习模型。在这里，我们测试了地面真相数据收集的自适应抽样策略是否可以改善贫困预测模型的性能。通过仿真，我们将现状抽样策略（随机和分层随机抽样）与替代方案进行比较，这些替代方案优先考虑基于模型不确定性或子人群中的模型绩效获得训练数据。也许令人惊讶的是，我们发现这些主动学习方法都没有改进随机抽样的统一抽样。我们讨论这些结果如何帮助塑造未来的努力，以完善基于机器学习的贫困估计。

Machine learning-based estimates of poverty and wealth are increasingly being used to guide the targeting of humanitarian aid and the allocation of social assistance. However, the ground truth labels used to train these models are typically borrowed from existing surveys that were designed to produce national statistics -- not to train machine learning models. Here, we test whether adaptive sampling strategies for ground truth data collection can improve the performance of poverty prediction models. Through simulations, we compare the status quo sampling strategies (uniform at random and stratified random sampling) to alternatives that prioritize acquiring training data based on model uncertainty or model performance on sub-populations. Perhaps surprisingly, we find that none of these active learning methods improve over uniform-at-random sampling. We discuss how these results can help shape future efforts to refine machine learning-based estimates of poverty.

下载PDF全文

下载文献需遵守相关版权规定

论文标题