论文标题

密度优化的无交叉映射和矩阵乘法用于联接项目(扩展版本)

Density-optimized Intersection-free Mapping and Matrix Multiplication for Join-Project Operations (extended version)

论文作者

Huang, Zichun, Chen, Shimin

论文摘要

联接项目操作是联接操作,然后是重复的消除投影操作。它用于多种应用程序,包括实体匹配,设置分析和图形分析。先前的工作提出了一种混合设计,该设计利用经典解决方案(即加入和重复数据删除)和MM(矩阵乘法)分别处理输入数据的稀疏和致密部分。但是,我们在最新的解决方案中观察到三个问题:1)稀疏和致密部分重叠的输出,需要额外的重复性步骤; 2)其表格到矩阵变换使属性值的假设过多地假设; 3)在Blas软件包中使用的MM与联接项目操作的特征之间存在不匹配。在本文中,我们提出了DIM3,这是一种针对联接项目操作的优化算法。要解决1),我们提出了一种无交叉分区方法,以完全删除最终的重复数据删除步骤。对于2),我们开发了一种优化的设计,用于将属性值映射到自然数。对于3),我们提出了denseec和sparsebmm算法来利用联接项目的结构,以提高效率。此外,我们扩展了DIM3,以考虑部分结果缓存和支持联接问题查询,包括Join-Aggregate和MJP(通过投影的多路加入)。使用现实世界和合成数据集的实验结果表明,DIM3的表现优于先前的联合项目溶液,而不是2.3 x-18x。与RDBMS相比,DIM3达到了数量级的加速顺序。

A Join-Project operation is a join operation followed by a duplicate eliminating projection operation. It is used in a large variety of applications, including entity matching, set analytics, and graph analytics. Previous work proposes a hybrid design that exploits the classical solution (i.e., join and deduplication), and MM (matrix multiplication) to process the sparse and the dense portions of the input data, respectively. However, we observe three problems in the state-of-the-art solution: 1) The outputs of the sparse and dense portions overlap, requiring an extra deduplication step; 2) Its table-to-matrix transformation makes an over-simplified assumption of the attribute values; and 3) There is a mismatch between the employed MM in BLAS packages and the characteristics of the Join-Project operation. In this paper, we propose DIM3, an optimized algorithm for the Join-Project operation. To address 1), we propose an intersection-free partition method to completely remove the final deduplication step. For 2), we develop an optimized design for mapping attribute values to natural numbers. For 3), we propose DenseEC and SparseBMM algorithms to exploit the structure of Join-Project for better efficiency. Moreover, we extend DIM3 to consider partial result caching and support Join-op queries, including Join-Aggregate and MJP (Multi-way Joins with Projection). Experimental results using both real-world and synthetic data sets show that DIM3 outperforms previous Join-Project solutions by a factor of 2.3x-18x. Compared to RDBMSs, DIM3 achieves orders of magnitude speedups.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源