论文标题

检测,蒸馏和更新:面对分销数据的学习数据库系统

Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data

论文作者

Kurmanji, Meghdad, Triantafillou, Peter

论文摘要

机器学习(ML)正在更改DB,因为许多DB组件被ML模型替换。在这种情况下,一个开放的问题是如何在存在数据更新的情况下更新此类ML模型。我们开始这项调查,重点是数据插入(分析DB中的更新)。我们研究如何更新神经网络(NN)模型时,当新数据遵循不同的分布(又称为“分布) - OOD),这使得先前训练的NNS不准确。我们问题设置中的一个要求是,学到的DB组件应确保旧数据和新数据的任务高度准确性(例如,用于近似查询处理(AQP),基数估计(CE),合成数据生成(DG)等)。本文提出了一个新颖的可更新性框架(DDUP)。 DDUP甚至基于不同的NNS,也可以为不同的DB系统组件提供更新性,而无需从头开始重新训练NNS。 DDUP需要两个组成部分:首先是一种新颖,有效,有原则的统计测试方法来检测OOD数据。其次,一种新颖的模型更新方法,基于知识蒸馏的转移学习原理,以有效地更新学习的模型,同时仍然确保高准确性。我们开发并展示了DDUP对三种不同的DB组件(AQP,CE和DG)的适用性,每个组件都采用了不同类型的NN。使用AQP,CE和DG详细信息DDUP的性能优势的REAL和BENCHMARC数据集的详细实验评估。

Machine Learning (ML) is changing DBs as many DB components are being replaced by ML models. One open problem in this setting is how to update such ML models in the presence of data updates. We start this investigation focusing on data insertions (dominating updates in analytical DBs). We study how to update neural network (NN) models when new data follows a different distribution (a.k.a. it is "out-of-distribution" -- OOD), rendering previously-trained NNs inaccurate. A requirement in our problem setting is that learned DB components should ensure high accuracy for tasks on old and new data (e.g., for approximate query processing (AQP), cardinality estimation (CE), synthetic data generation (DG), etc.). This paper proposes a novel updatability framework (DDUp). DDUp can provide updatability for different learned DB system components, even based on different NNs, without the high costs to retrain the NNs from scratch. DDUp entails two components: First, a novel, efficient, and principled statistical-testing approach to detect OOD data. Second, a novel model updating approach, grounded on the principles of transfer learning with knowledge distillation, to update learned models efficiently, while still ensuring high accuracy. We develop and showcase DDUp's applicability for three different learned DB components, AQP, CE, and DG, each employing a different type of NN. Detailed experimental evaluation using real and benchmark datasets for AQP, CE, and DG detail DDUp's performance advantages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源