论文标题
重新访问大数据管理系统中加入查询的运行时动态优化
Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management Systems
论文作者
论文摘要
查询优化仍然是大数据管理系统的开放问题。传统优化器是基于成本的,并使用中间结果红衣主义的统计估计来分配成本并选择最佳计划。但是,由于多个谓词与单个数据集的局部序列之间的相关性,具有查询参数的谓词或涉及用户定义功能(UDFS)的谓词引起的过滤条件,因此这种估计值往往变得较少准确。因此,传统的查询优化者倾向于忽略或错误地计算这些设置,从而导致次优执行计划。鉴于当今数据的数量,次优计划很快就会变得非常效率低下。 在这项工作中,我们重新审视运行时动态优化的旧概念,并将其调整为共享的无分布式数据库系统AsterixDB。优化分阶段运行(重新优化点),首先首先执行所有数据集本地谓词。每个阶段创建的中间结果用于重新优化剩余的查询。这种重视方法避免了不正确的中间结果基数估计,从而导致了更好的执行计划。尽管它引入了实现这些中间结果的开销,但我们的实验表明,由于优化收益,该开销相对较小,而支付的价格是可接受的。实际上,我们的实验评估表明,与当前默认的AsterixDB计划以及通过基于静态成本的优化(即基于最初的数据集统计数据)和其他最新的方法相比,运行时动态优化可导致更好的执行计划。
Query Optimization remains an open problem for Big Data Management Systems. Traditional optimizers are cost-based and use statistical estimates of intermediate result cardinalities to assign costs and pick the best plan. However, such estimates tend to become less accurate because of filtering conditions caused either from undetected correlations between multiple predicates local to a single dataset, predicates with query parameters, or predicates involving user-defined functions (UDFs). Consequently, traditional query optimizers tend to ignore or miscalculate those settings, thus leading to suboptimal execution plans. Given the volume of today's data, a suboptimal plan can quickly become very inefficient. In this work, we revisit the old idea of runtime dynamic optimization and adapt it to a shared-nothing distributed database system, AsterixDB. The optimization runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created from each stage is used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimations, thus leading to much better execution plans. While it introduces the overhead for materializing these intermediate results, our experiments show that this overhead is relatively small and it is an acceptable price to pay given the optimization benefits. In fact, our experimental evaluation shows that runtime dynamic optimization leads to much better execution plans as compared to the current default AsterixDB plans as well as to plans produced by static cost-based optimization (i.e. based on the initial dataset statistics) and other state-of-the-art approaches.