论文标题
受欢迎程度驱动数据集成
Popularity Driven Data Integration
论文作者
论文摘要
越来越多的人,随着大规模分析的关注,我们面临着整合来自多个来源的数据的需求。问题在于这些数据无法重复使用。最终结果是高成本,进一步的缺点是,所得的集成数据将再次被重新使用。 Itelos是一种通用方法,旨在最大程度地减少此过程的影响。直觉是,数据将根据数据的受欢迎程度有所不同:重复使用的一定一组数据越多,将其重复使用越多,并且在重新使用过程中更改的数据越少,从而降低了整体数据预处理成本,同时增加了向后兼容性和未来共享。
More and more, with the growing focus on large scale analytics, we are confronted with the need of integrating data from multiple sources. The problem is that these data are impossible to reuse as-is. The net result is high cost, with the further drawback that the resulting integrated data will again be hardly reusable as-is. iTelos is a general purpose methodology aiming at minimizing the effects of this process. The intuition is that data will be treated differently based on their popularity: the more a certain set of data have been reused, the more they will be reused and the less they will be changed across reuses, thus decreasing the overall data preprocessing costs, while increasing backward compatibility and future sharing