Answering "why did this happen?"

Data analysis is rarely a one-off linear process – it requires performing analyses, and carefully studying and understanding the results. The latter process is particularly challenging because analysts lack tools to help understand why analysis results look strange, contain outliers, or have patterns that differ from their expectations. Our lab develops tools and algorithms to provide user-understandable explanations.

Scorpion

Consider a simple analytic query that computes a company’s total expenses by month, and shows that last month’s expenses was unexpectedly high. The analyst will naturally want to understand why – perhaps the company has put more resources into a new customer demographic, or a department is overspending.

Currently, the analyst must manually split the input data along different dimensions (e.g., dept, customer age), and hope that re-running the query will cause changes. If there is more than one outlier, or many dimensions in the dataset, this ad-hoc process quickly becomes untenable.

Scorpion is a library that helps answer these “why” questions. The user simply selects examples of outlier and normal results and Scorpion will look for subsets of the input table that potentially explains the outliers and finds a combination of attribute values (predicate) that describes that subset. To find such predicates, we develop a score for how much influence a predicate on the outliers, and design efficient algorithms to find predicates with lots of influence.

Get Scorpion

Scorpion is a python library that can be installed using pip:

pip install scorpion

After Scorpion is installed, DBWipes will be able to use the library to interactively explain outliers in its visualization.

Check out the github page for code and usage instructions.

QFix

Traditional database cleaning and explanation research has focused on techniques such as anomaly/outlier detection, dimensionality reduction, data summarization, predicate synthesis, ETL, and others, as ways to prevent, detect, or resolve (clean) data errors.

However, these data oriented approaches ignore a key vector of data errors – queries! The majority of user facing systems – billing, HR, customer service, web application – generate queries based on user submissions. Can we use the sequence of past queries from the transaction log to both find and repair incorrect errors due to user slip-ups?

QFix is a fundamentally novel approach towards getting to the root cause of errors in your database. Instead of describing data errors by examining data in the database, explains and fixes data errors by analyzing your past queries and suggests ways to repair the incorrect queries that it finds.

DBWipes

DBWipes is a lightweight interactive visualization tool that lets users run aggregation queries over a database and interactively filter the dataset along different dimensions. Users can execute SQL GROUP-BY queries through a simple form interface, and render the results as a scatterplot. DBWipes renders a histogram of each attribute that users use to select and filter subsets of the attribute’s values. Using this tool, users can easily compare the results of an aggregation query over different subsets of the data.

DBWipes currently connects to a PostgreSQL backend, which executes the queries that are issued by the interface.

Get DBWipes

DBWipes is a python web-application can be installed using pip:

pip install dbwipes

Check out the github page for code and usage instructions

Publications

  1. Skipping-oriented Partitioning for Columnar Layouts
    Liwen Sun, Michael J. Franklin, Jiannan Wang, Eugene Wu
    VLDB 2017
  2. QFix: Diagnosing errors through query histories
    Xiaolan Wang, Alexandra Meliou, Eugene Wu
    SIGMOD 2017
  3. QFix: Demonstrating error diagnosis in query histories
    Xiaolan Wang, Alexandra Meliou, Eugene Wu
    SIGMOD 2016 Demo
  4. QFix: Diagnosing errors through query histories
    Xiaolan Wang, Alexandra Meliou, Eugene Wu
    Arxiv 2016
  5. Scorpion: Explaining Away Outliers in Aggregate Queries
    Eugene Wu, Samuel Madden
    VLDB 2013 (Selected as one of the best papers of the conference!) Slides
  6. SubZero: a Fine-Grained Lineage System for Scientific Databases
    Eugene Wu, Samuel Madden, Michael Stonebraker
    ICDE 2013 (Best of conference)
  7. Partitioning Techniques for Fine-Grained Indexing
    Eugene Wu, Sam Madden
    ICDE 2011
  8. No Bits Left Behind
    Eugene Wu, Carlo Curino, Sam Madden
    CIDR 2011
  9. Relational Cloud: A Database-as-a-Service for the Cloud
    Carlo Curino, Evan Jones, Raluca Popa, Nirmesh Malviya, Eugene Wu, Sam Madden, Hari Balakrishnan, Nickolai Zeldovich
    CIDR 2011
  10. Relational Cloud: The Case for a Database Service
    Carlo Curino, Evan Jones, Yang Zhang, Eugene Wu, Sam Madden
  11. TrajStore: An Adaptive Storage System for Very Large Trajectory Data Sets
    Philippe Cudre-Mauroux, Eugene Wu, Sam Madden
    ICDE 2010
  12. Demonstration of the TrajStore System
    Eugene Wu, Philippe Cudre-Mauroux, Sam Madden
    VLDB 2009
  13. The Case for RodentStore: An Adaptive, Declarative Storage System
    Philippe Cudre-Mauroux, Eugene Wu, Sam Madden
    CIDR 2009
  14. WebTables: Exploring the Power of Tables on the Web
    Michael Cafarella, Alon Halevy, Daisy Wang, Eugene Wu, Yang Zhang
    VLDB 2008
  15. Uncovering the Relational Web
    Michael Cafarella, Nodira Khoussainova, Daisy Wang, Eugene Wu, Yang Zhang, Alon Halevy
    WebDB 2008
  16. SASE: Complex Event Processing over Streams (Demo)
    Daniel Gyllstrom, Eugene Wu, Hee-Jin Chae, Yanlei Diao, Patrick Stahlberg, Gordon Anderson
    CIDR 2007
  17. High-performance complex event processing over streams
    Eugene Wu, Yanlei Diao, Shariq Rizvi
    SIGMOD 2006
  18. SASE: Complex Event Processing over Streams
    Daniel Gyllstrom, Eugene Wu, Hee-Jin Chae, Yanlei Diao, Patrick Stahlberg, Gordon Anderson
    CoRR 2006
  19. Probabilistic Data Management for Pervasive Computing: The Data Furnace Project
    Minos N. Garofalakis, Kurt P. Brown, Michael J. Franklin, Joseph M. Hellerstein, Daisy Zhe Wang, Eirinaios Michelakis, Liviu Tancau, Eugene Wu, Shawn R. Jeffery, Ryan Aipperspach
    IEEE Data Eng. Bull.
  20. Design Considerations for High Fan-In Systems: The HiFi Approach
    Michael J. Franklin, Shawn R. Jeffery, Sailesh Krishnamurthy, Frederick Reiss, Shariq Rizvi, Eugene Wu, Owen Cooper, Anil Edakkunni, Wei Hong
    CIDR 2005
  21. HiFi: A Unified Architecture for High Fan-in Systems
    Owen Cooper, Anil Edakkunni, Michael J. Franklin, Wei Hong, Shawn R. Jeffery, Sailesh Krishnamurthy, Frederick Reiss, Shariq Rizvi, Eugene Wu
    VLDB 2004 Demo