High-quality data acquisition w/ humans

Web services such as Yelp, Quora, Amazon, and others rely on collecting and presenting high-quality human generated content – reviews, questions, answers, opinions – in order to present and share with their community of users. Similarly, machine learning applications such as fake news detection, structured data extraction, self driving cars, rely on high-quality training data that is labeled by humans. The key challenge in these applications is to quickly collect high-quality data from humans.

We have developed human-powered database and workflow systems that answer complex, open-world SQL queries using a combination of database operators as well as crowdsourced tasks. These systems, such as Qurk and Wisteria focus on SQL query processing, and data cleaning pipelines, respectively.

We also develop techniques such as CLAMShell that reduce large scale data acquisition latencies from minutes or hours to seconds, as well as techniques to directly improve the quality of the collected data by providing targeted, context-specific feedback in the data collection interface.

Publications

  1. PreCog: Improving Crowdsourced Data Quality Before Acquisition
    Hamed Nilforoshan, Jiannan Wang, Eugene Wu
    Arxiv 2017
  2. Segment-Predict-Explain for Automatic Writing Feedback
    Hamed Nilforoshan, James Sands, Kevin Lin, Rahul Khanna, Eugene Wu
    Collective Intelligence 2017
  3. Dialectic: Enhancing Text Input Fields with Automatic Feedback to Improve Social Content Writing Quality
    Hamed Nilforoshan, James Sands, Kevin Lin, Rahul Khanna, Eugene Wu
    ArXiv 2017
  4. CLAMShell: Speeding up Crowds for Low-latency Data Labeling
    Daniel Haas, Jiannan Wang, Eugene Wu, Michael J. Franklin
    VLDB 2016
  5. Wisteria: Nurturing Scalable Data Cleaning Infrastructure (Demo)
    Daniel Haas, Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Eugene Wu
    VLDB 2015
  6. Human-powered Sorts and Joins
    Adam Marcus, Eugene Wu, David Karger, Samuel Madden, Robert Miller
    VLDB 2012
  7. Demonstration of Qurk: A Query Processor for Human Operators
    Adam Marcus, Eugene Wu, David Karger, Samuel Madden, Robert Miller
    SIGMOD 2011
  8. Crowdsourced Databases: Query Processing with People
    Adam Marcus, Eugene Wu, Sam Madden, Robert Miller
    CIDR 2011