Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation

Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation
发布时间：2022-06-23

报告时间：2022年6月23日（周四）上午09:30

报告地点：在线报告

报告人：Wook-Shin Han

报告题目：Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation

Abstract: Graph pattern cardinality estimation is the problem of estimating the number of embeddings |M| of a query graph in a data graph. This fundamental problem arises, for example, during query planning in subgraph matching algorithms. There are two major approaches to solving the problem: sampling and synopsis. Synopsis (or summary)-based methods are fast and accurate if synopses capture information of graphs well. However, these methods suffer from large errors due to loss of information during summarization and inherent assumptions. Sampling-based methods are unbiased but suffer from large estimation variance due to large sample space.

To address these limitations, we propose Alley, a hybrid method that combines both sampling and synopses. Alley employs 1) a novel sampling strategy, random walk with intersection, which effectively reduces the sample space, 2) branching to further reduce variance, and 3) a novel mining approach that extracts and indexes tangled patterns as synopses which are inherently difficult to estimate by sampling. By using them in the online estimation phase, we can effectively reduce the sample space while still ensuring unbi-asedness. We establish that Alley has worst-case optimal runtime and approximation quality guarantees for any given error bound and required confidence. In addition to the theoretical aspect of Alley, our extensive experiments show that Alley outperforms the state-of-the-art methods by up to orders of magnitude higher accuracy with similar efficiency.

Biography: Professor Wook-Shin Han leads the database group at POSTECH. He is currently a Full Professor in the Graduate School of Artificial Intelligence, the Department of Convergence IT Engineering, and the Department of Computer Science and Engineering in POSTECH. He obtained his Ph.D. from KAIST in 2001. His primary research efforts have been devoted to developing new techniques in DBMS "engine research." He has developed an object-relational DBMS supporting multiple language bindings. He has also developed the tight coupling technology of DBMS with IR features during his Ph.D. study. At the IBM Almaden Research Center, he has developed progressive query optimization in the parallel DB2 as a postdoc. He also invented the new concept of "parallelizing query optimization" for faster query compilation by exploiting multi-core architecture. Recently, he has developed three systems called TurboGraph++ (SIGMOD2018), iTurboGraph (SIGMOD2021), and TurboFlux (SIGMOD2018) for trillion-scale, incremental graph analytics. He e tensively published at major international journals and conferences, including SIGMOD, VLDB, KDD, ICDE, WWW, IEEE Transactions on Knowledge and Data Engineering (TKDE), and VLDB Journal. He regularly serves as a PC member for SIGMOD, VLDB, and ICDE. He serves/served as an associate editor of several international journals including the VLDB Journal, SIGMOD Record, IEEE TKDE, and Information Sciences. He also serves/served as a trustee for VLDB Endowment, a PC area vice chair for ICDE 2019, and an industrial co-chair for ICDE 2015.

Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation 发布时间：2022-06-23

Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation
发布时间：2022-06-23