论文标题
从野外数据到教科书数据,可重复刷新来自国家青年数据库的国家纵向调查的工资数据
A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database
论文作者
论文摘要
教科书数据对于教授统计和数据科学方法至关重要,因为它们很干净,从而使讲师可以专注于方法论。理想情况下,教科书数据集定期刷新,尤其是当它们是从正在进行的数据收集中获取的子集时。使用当代数据进行教学也很重要,以掩盖该方法与当今有意义的感觉。本文介绍了1990年代初期从国家纵向青年纵向调查(NLSY79)中提取的工资中刷新教科书数据的试验和磨难。该数据可用于教授纵向数据的建模和探索性分析。 NLSY79的子集(包括工资数据)可以在许多教科书和研究文章的补充文件中找到。 NLSY79数据库已不断更新到2018年,因此可以使用新的记录。在这里,我们描述了刷新工资数据的旅程,并记录了该过程,以便可以定期更新数据到未来。我们的旅程很困难,因为从原始数据到工资教科书子集采取的步骤和决定尚未清楚地阐明。我们一直在努力为其他人提供可重复的工作流程,这也希望激发更多的尝试来刷新教学数据。三个新的数据集和生产它们的代码在称为“ Yowie”的开源R软件包中提供。
Textbook data is essential for teaching statistics and data science methods because they are clean, allowing the instructor to focus on methodology. Ideally textbook data sets are refreshed regularly, especially when they are subsets taken from an on-going data collection. It is also important to use contemporary data for teaching, to imbue the sense that the methodology is relevant today. This paper describes the trials and tribulations of refreshing a textbook data set on wages, extracted from the National Longitudinal Survey of Youth (NLSY79) in the early 1990s. The data is useful for teaching modeling and exploratory analysis of longitudinal data. Subsets of NLSY79, including the wages data, can be found in supplementary files from numerous textbooks and research articles. The NLSY79 database has been continuously updated through to 2018, so new records are available. Here we describe our journey to refresh the wages data, and document the process so that the data can be regularly updated into the future. Our journey was difficult because the steps and decisions taken to get from the raw data to the wages textbook subset have not been clearly articulated. We have been diligent to provide a reproducible workflow for others to follow, which also hopefully inspires more attempts at refreshing data for teaching. Three new data sets and the code to produce them are provided in the open source R package called `yowie`.