论文标题
确定测试基准和生产数据之间的上下文变化
Identifying the Context Shift between Test Benchmarks and Production Data
论文作者
论文摘要
尽管在基准数据集上达到了很高的精度,但机器学习模型通常在生产数据上还是很脆弱。基准数据集传统上提供了双重目的:首先,基准标准提供了一个标准,可以在该标准上进行机器学习研究人员可以比较不同的方法,其次,基准提供了一个模型,尽管不完善了现实世界。测试基准的不完整(以及训练模型的数据)阻碍了机器学习中的鲁棒性,启用快捷方式学习,并使模型系统地容易出现,以使分布式分布和对抗性扰动数据。传统上,单个静态基准数据集与生产数据集之间的不匹配被描述为数据集偏移。为了澄清如何解决测试基准和生产数据之间的不匹配,我们介绍了上下文转移以描述基础数据生成过程中语义上有意义的变化。 Moreover, we identify three methods for addressing context shift that would otherwise lead to model prediction errors: first, we describe how human intuition and expert knowledge can identify semantically meaningful features upon which models systematically fail, second, we detail how dynamic benchmarking - with its focus on capturing the data generation process - can promote generalizability through corroboration, and third, we highlight that clarifying a model's limitations can reduce unexpected errors.强大的机器学习专注于基准超出基准的模型性能,因此,我们考虑了三个模型有机体领域 - 面部表达识别,深膜检测和医学诊断 - 突出了基准测试任务中隐含的假设如何导致实践中的错误。通过密切关注上下文的作用,研究人员可以设计更全面的基准,减少上下文移动错误并提高可推广性。
Machine learning models are often brittle on production data despite achieving high accuracy on benchmark datasets. Benchmark datasets have traditionally served dual purposes: first, benchmarks offer a standard on which machine learning researchers can compare different methods, and second, benchmarks provide a model, albeit imperfect, of the real world. The incompleteness of test benchmarks (and the data upon which models are trained) hinder robustness in machine learning, enable shortcut learning, and leave models systematically prone to err on out-of-distribution and adversarially perturbed data. The mismatch between a single static benchmark dataset and a production dataset has traditionally been described as a dataset shift. In an effort to clarify how to address the mismatch between test benchmarks and production data, we introduce context shift to describe semantically meaningful changes in the underlying data generation process. Moreover, we identify three methods for addressing context shift that would otherwise lead to model prediction errors: first, we describe how human intuition and expert knowledge can identify semantically meaningful features upon which models systematically fail, second, we detail how dynamic benchmarking - with its focus on capturing the data generation process - can promote generalizability through corroboration, and third, we highlight that clarifying a model's limitations can reduce unexpected errors. Robust machine learning is focused on model performance beyond benchmarks, and as such, we consider three model organism domains - facial expression recognition, deepfake detection, and medical diagnosis - to highlight how implicit assumptions in benchmark tasks lead to errors in practice. By paying close attention to the role of context, researchers can design more comprehensive benchmarks, reduce context shift errors, and increase generalizability.