论文标题
在创建复杂的数据系统中挑战挑战:发展理念
Navigating the challenges in creating complex data systems: a development philosophy
论文作者
论文摘要
从这个角度来看,我们认为,尽管过去十年来民主化了有力的数据科学和机器学习工具,但开发了可信赖和有效的数据科学系统(DSS)的代码越来越难。我们确定的许多根本原因之一是不正当的激励措施和缺乏广泛的软件工程(SE)技能,这些原因自然会引起DSSSSS可重复性的当前系统危机。我们分析了为什么SE和构建大型复杂系统通常很难。基于这些见解,我们确定SE如何解决这些困难以及如何应用和推广SE方法来构建适合目的的DSS。我们主张两个关键的发展理念,即,一个人应该逐步发展 - 不是双层计划和构建 - DSSS,并且在开发过程中应始终采用两种类型的反馈回路:一种测试代码的正确性,另一种评估代码效率的反馈循环。
In this perspective, we argue that despite the democratization of powerful tools for data science and machine learning over the last decade, developing the code for a trustworthy and effective data science system (DSS) is getting harder. Perverse incentives and a lack of widespread software engineering (SE) skills are among many root causes we identify that naturally give rise to the current systemic crisis in reproducibility of DSSs. We analyze why SE and building large complex systems is, in general, hard. Based on these insights, we identify how SE addresses those difficulties and how we can apply and generalize SE methods to construct DSSs that are fit for purpose. We advocate two key development philosophies, namely that one should incrementally grow -- not biphasically plan and build -- DSSs, and one should always employ two types of feedback loops during development: one which tests the code's correctness and another that evaluates the code's efficacy.