High-quality data acquisition w/ humans

Web services such as Yelp, Quora, Amazon, and others rely on collecting and presenting high-quality human generated content – reviews, questions, answers, opinions – in order to present and share with their community of users. Similarly, machine learning applications such as fake news detection, structured data extraction, self driving cars, rely on high-quality training data that is labeled by humans. The key challenge in these applications is to quickly collect high-quality data from humans.

We have developed human-powered database and workflow systems that answer complex, open-world SQL queries using a combination of database operators as well as crowdsourced tasks. These systems, such as Qurk and Wisteria focus on SQL query processing, and data cleaning pipelines, respectively.

We also develop techniques such as CLAMShell that reduce large scale data acquisition latencies from minutes or hours to seconds, as well as techniques to directly improve the quality of the collected data by providing targeted, context-specific feedback in the data collection interface.

Publications

  1. PopFactor: Live-Streamer Behavior and Popularity
    Robert Netzorg, Lauren Arnett, Augustin Chaintreau, Eugene Wu
    ICWSM 2021
  2. Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment
    Pei Wang, Jiannan Wang, Ryan Shea, Eugene Wu
    SIGMOD 2019
  3. Cross-platform Interactions and Popularity in the Live-streaming Community
    Lauren Arnett, Robert Netzorg, Augustin Chaintreau, Eugene Wu
    CHI Latebreaking 2019
  4. Leveraging Quality Prediction Models for Automatic Writing Feedback
    Hamed Nilforoshan, Eugene Wu
    ICWSM 2018
  5. PreCog: Improving Crowdsourced Data Quality Before Acquisition
    Hamed Nilforoshan, Jiannan Wang, Eugene Wu
    Arxiv 2017
  6. Segment-Predict-Explain for Automatic Writing Feedback
    Hamed Nilforoshan, James Sands, Kevin Lin, Rahul Khanna, Eugene Wu
    Collective Intelligence 2017
  7. Dialectic: Enhancing Text Input Fields with Automatic Feedback to Improve Social Content Writing Quality
    Hamed Nilforoshan, James Sands, Kevin Lin, Rahul Khanna, Eugene Wu
    ArXiv 2017
  8. CLAMShell: Speeding up Crowds for Low-latency Data Labeling
    Daniel Haas, Jiannan Wang, Eugene Wu, Michael J. Franklin
    VLDB 2016
  9. Wisteria: Nurturing Scalable Data Cleaning Infrastructure
    Daniel Haas, Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Eugene Wu
    VLDB 2015 demo
  10. Human-powered Sorts and Joins
    Adam Marcus, Eugene Wu, David Karger, Samuel Madden, Robert Miller
    VLDB 2012
  11. Demonstration of Qurk: A Query Processor for Human Operators
    Adam Marcus, Eugene Wu, David Karger, Samuel Madden, Robert Miller
    SIGMOD 2011
  12. Crowdsourced Databases: Query Processing with People
    Adam Marcus, Eugene Wu, Sam Madden, Robert Miller
    CIDR 2011