Modern database management systems (DBMSs) employ sophisticated query optimization techniques that enable the generation of efficient execution plans for queries over very large data sets. A variety of other applications also process large data sets, but cannot leverage database-style query optimization for their code. In this project, an open-source programming language compiler will be enhanced with database-style query optimization capabilities. Data-intensive parts of ordinary programs will be executed in chunks using different execution plans. Based on feedback from earlier chunks, alternative plans might be used for later chunks. The compiler could be used for a variety of data-intensive applications, allowing all of them to benefit from this class of performance optimizations.

Existing query optimization techniques for in-memory processing are limited in several ways: (a) they are not extensively used outside relational database management systems; (b) they are limited to a handful of relational operators, and do not cover access patterns or dynamically-defined functions found in other data-analysis scenarios; (c) they treat the underlying compiler as a black-box, with unpredictable performance depending on which compiler is used with which compiler settings; (d) they often bake-in design choices that may be appropriate for usage within a particular DBMS, but not for more general cases. This project directly addresses these challenges by optimizing data-analysis style queries expressed as tight loops in a conventional imperative programming language. This project will extend an open-source compiler (GraalVM/Truffle) with both known and novel optimization techniques that will automatically be applied whenever the compiler identifies that a loop is time-consuming. Integration into the compiler allows many applications to efficiently process large data sets. The system will support dynamic queries involving user-defined functions and arbitrary access patterns. Database-style and compiler optimizations will co-exist in one system, eliminating some of the mismatches that happen when the compiler is used as a black-box by a DBMS. The system will tune a variety of run-time execution parameters automatically, with minimal guidance from the programmer. The extended compiler will be validated by developing an interactive data exploration application that will allow users to dynamically specify and analyze a variety of large in-memory datasets.

Principal Investigators

Open Source Software

Publications

JoinBoost: Grow Trees Over Normalized Data Using Only SQL
Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu
VLDB 2023
Random Forests over Normalized Data in CPU-GPU DBMSes
Zezhou Huang, Pavan Kalyan, Rathijit Sen, Eugene Wu
DaMoN at SIGMOD 2023
SmokedDuck Demonstration: SQLStepper
Haneen Mohammed, Charlie Summers, Sughosh Kaushik, Eugene Wu
SIGMOD Demo 2023
Amulet: Adaptive matrix-multiplication-like tasks
Junyoung Kim, Kenneth A Ross, Eric Sedlar, Lukas Stadler
DaMoN workshop 2023
GAMUT: Matrix Multiplication-Like Tasks on GPUs
Xincheng Xie, Junyoung Kim and Kenneth Ross
ADMS workshop 2023
DIEL: Interactive Visualization Beyond the Here and Now
Yifan Wu, Remco Chang, Joe Hellerstein, Arvind Satyanarayan, and Eugene Wu
VIS 2021
Adaptive Code Generation for Data-Intensive Analytics
Wangda Zhang, Junyoung Kim, Kenneth A. Ross, Eric Sedlar, Lucas Stadler
VLDB 2021

Query Optimization for Data Intensive Applications

Principal Investigators

Open Source Software

Publications