Web services such as Yelp, Quora, Amazon, and others rely on collecting and presenting high-quality human generated content – reviews, questions, answers, opinions – in order to present and share with their community of users. Similarly, machine learning applications such as fake news detection, structured data extraction, self driving cars, rely on high-quality training data that is labeled by humans. The key challenge in these applications is to quickly collect high-quality data from humans.
We have developed human-powered database and workflow systems that answer complex, open-world SQL queries using a combination of database operators as well as crowdsourced tasks. These systems, such as Qurk and Wisteria focus on SQL query processing, and data cleaning pipelines, respectively.
We also develop techniques such as CLAMShell that reduce large scale data acquisition latencies from minutes or hours to seconds, as well as techniques to directly improve the quality of the collected data by providing targeted, context-specific feedback in the data collection interface.