The holy grail of data cleaning is a system that does it for you. BoostClean is our first step towards this vision.

Many data science projects and companies follow the process of collecting (dirty) data from a variety of domains, performing extensive data cleaning, and then developing a model. Ideally the data cleaning effort will actually improve the quality of the model (but as ActiveClean showed, this may not be the case!) In addition, a huge amount of time is spent addressing simple but tedious errors such as outlier removal, duplicate elimination, imputation, etc. These are structurally similar, but distinct for every domain and dataset. Thus the data scientist goes through a list of data cleaning functions (e.g., Python cleaning functions) and manually checks if they apply; if so, then how to parameterize the functions.

BoostClean attempts to automate this process by treating it as a boosting problem. Each data cleaning operation effectively adds a new cleaning feature to the input of the downstream ML model, and a combination of Boosting and feature selection can be used to identify a good sequence of cleaning operations that will best improve the ML model!

ActiveClean

Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug. Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. The following figures show how data cleaning can degrade the machine learning model.

Shows how systematic corruption of data (from circles to crosses) can lead to a shifted, incorrect model.

Illustrates the true model if the full dataset were cleaned.

Shows how combining two cleaned records (blue) with the dirty records leads to a worse model than no cleaning.

Shows how only using the two cleaned records can also result in a worse model due to sampling error.

ActiveClean is an iterative cleaning framework that can correctly retrain the machine learning model when data is cleaned, and provides a set of optimizations to select the best data to be cleaned. In this way, you only need to clean a small subset of the data in order to produce a model similar to if the full dataset were cleaned.

Code

The ActiveClean codebase is written in Python and includes the core ActiveClean algorithm, a data cleaning benchmark, and (in the future), an dirty data detector:

The Data Cleaning Benchmark automatically injects data errors into your datasets to test the robustness of your machine learning models to data errors. It can be installed using pip:

News

Collaborators

ActiveClean is a collaboration between the WuLab at Columbia University, the AMPLab at University of California, Berkeley, and Jiannan Wang at Simon Fraser University.

Publications

From Cleaning Before ML to Cleaning For ML
Felix Neutatz, Binger Chen, Ziawasch Abedjan, Eugene Wu
Invited, IEEE Data Engineering Bulletin 2021
ActiveDeeper: A Model-based Active Data Enrichment system
Liang Zhao, Qingcan Li, Pei Wang, Jiannan Wang, Eugene Wu
VLDB 2020 demo
Towards Complaint-driven ML Workflow Debugging
Lampros Flokas, Young Wu, Jiannan Wang, Eugene Wu
MLOps 2020
AlphaClean: Automatic Generation of Data Cleaning Pipelines
Sanjay Krishnan, Eugene Wu
ArXiv 2019
Deeper: A Data Enrichment System Powered by Deep Web.
Pei Wang, Yongjun He, Ryan Shea, Jiannan Wang, Eugene Wu
SIGMOD (demo) 2018
BoostClean: Automated Error Detection and Repair for Machine Learning
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Eugene Wu
Tech Report 2017
Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations
Sanjay Krishnan, Daniel Haas, Michael J. Franklin, Eugene Wu
HILDA 2016
ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning
Sanjay Krishnan, Michael Franklin, Ken Goldberg, Jiannan Wang, Eugene Wu
SIGMOD 2016 Demo (Demo Award Winner!)
ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, Ken Goldberg
Arxiv 2016
SampleClean: Fast and Reliable Analytics on Dirty Data (overview paper)
Sanjay Krishnan, Jiannan Wang, Michael J Franklin, Ken Goldberg, Tim Kraska, Tova Milo, Eugene Wu
IEEE Data Eng. Bulletin 2015
A Demonstration of DBWipes: Clean as You Query
Eugene Wu, Samuel Madden, Michael Stonebraker
VLDB 2012

Auto-Data Cleaning

BoostClean

ActiveClean

Code

News

Collaborators

Publications