NSF III-2312991

Machine-learning tools have become ubiquitous in modern information systems. The data inputs used by these tools often originate from relational databases. The data outputs generated by those tools are often stored in databases, where they can be used for subsequent data analysis. Typically, however, the learning process itself is performed outside the database system. This project investigates the opportunity for performing more of the machine learning work within the database itself, avoiding expensive (and often redundant) data export and import. In partnership with researchers from Relational-AI and Microsoft, the Columbia University team will design and build two interacting open-source systems named MARQUE and ZORK. These systems will make data analysis more efficient and effective for database-resident information. Improved efficiency will lead to faster, more cost-effective machine learning, and executing ML within the DBMS will simplify operational complexity and benefit from DBMS features such as scalability, access control, and data management. Ultimately, this work will broaden the adoption of machine learning technologies in a wide range of data-intensive disciplines.

MARQUE will be a database management system that supports machine learning primitives such as linear algebra operations within the context of a query processing engine. The system will efficiently compile SQL queries using embedded machine learning models within the database, combining state-of-the-art query processing techniques with highly engineered linear algebra algorithms. MARQUE will allow components of the machine-learning pipeline itself to be formulated as in-database operations, avoiding unnecessary data copying. Conventional SQL analytic queries that can be reformulated using extensions of operators like matrix multiplication can be optimized to use efficient execution plans involving specialized algorithms for such operators. To further support in-database machine learning, the project investigators will build ZORK, a system to support machine learning at scale that will make use of the infrastructure provided by MARQUE. ZORK will scale to very large datasets by processing factorized representations of the data rather than explicitly materializing large joins. This project will develop new and innovative query processing techniques for queries involving both conventional relational operators and generalized linear algebra operators. Tight integration will facilitate query optimization within and between operators. Using this system, a range of machine learning techniques will be developed that operate entirely within the database management system, avoiding data export and simplifying concerns such as data privacy administration.

Principal Investigators

Open Source Software

Publications

  1. FaDE: More Than a Million What-ifs Per Second
    Haneen Mohammed, Alexander Yao, Lampros Flokas, Charlie Summers, Hongbin Zhong, Gramit Chan, Surata Mitra, Eugene Wu
    In Review 2024
  2. SET: Searching Effective Supervised Learning Augmentations in Large Tabular Data Repositories
    Jerry Liu, Zachary Huang, Eugene Wu
    GUIDEAI Workshop at SIGMOD 2024
  3. Lightweight Materialization for Fast Dashboards Over Joins
    Zachary Huang, Eugene Wu
    SIGMOD 2024
  4. The Fast and the Private: Task-based Dataset Search
    Zezhou Huang, Jiaxiang Liu, Haonan Wang, Eugene Wu
    CIDR 2024 Slides
  5. Saibot: A Differentially Private Data Search Platform
    Zezhou Huang, Jiaxiang Liu, Daniel Alabi, Raul Castro Fernandez, Eugene Wu
    VLDB 2023