论文标题
数据准备报告
Data Readiness Report
论文作者
论文摘要
数据探索和质量分析是AI管道中重要但繁琐的过程。当前对机器学习任务的数据清洁和数据准备评估的实践主要以任意方式进行,从而限制其重复使用并导致生产率丧失。我们将数据准备报告的概念作为随附的文档介绍给数据集,该数据集允许数据消费者获得对输入数据质量的详细见解。确定并记录了各种质量维度的数据特征和挑战,牢记透明度和解释性的原则。数据准备报告还可以作为所有数据评估操作(包括应用转换)的记录。这为数据治理和管理的目的提供了详细的血统。实际上,该报告捕获并记录了各种角色在数据准备和评估工作流程中所采取的行动。加时赛这成为最佳实践的存储库,并有可能推动推荐系统在汽车线上构建自动数据准备工作流程[8]。我们预计,与数据表[9],数据集营养标签[11],Factsheets [1]和模型卡[15]一起,数据准备报告在数据和AI生命周期文档方面取得了重大进展。
Data exploration and quality analysis is an important yet tedious process in the AI pipeline. Current practices of data cleaning and data readiness assessment for machine learning tasks are mostly conducted in an arbitrary manner which limits their reuse and results in loss of productivity. We introduce the concept of a Data Readiness Report as an accompanying documentation to a dataset that allows data consumers to get detailed insights into the quality of input data. Data characteristics and challenges on various quality dimensions are identified and documented keeping in mind the principles of transparency and explainability. The Data Readiness Report also serves as a record of all data assessment operations including applied transformations. This provides a detailed lineage for the purpose of data governance and management. In effect, the report captures and documents the actions taken by various personas in a data readiness and assessment workflow. Overtime this becomes a repository of best practices and can potentially drive a recommendation system for building automated data readiness workflows on the lines of AutoML [8]. We anticipate that together with the Datasheets [9], Dataset Nutrition Label [11], FactSheets [1] and Model Cards [15], the Data Readiness Report makes significant progress towards Data and AI lifecycle documentation.