Data Cleaning: Overview and Emerging Challenges [Tutorial 1]
SUNDAY, June 26, 2016 (8:30am - 12:00pm)
Abstract: Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.
URL for the Slides:
Xu Chu is a PhD student in the Cheriton School of Computer Science at the University of Waterloo. His main research interests are data quality and data cleaning. He won the prestigious Microsoft Research PhD fellowship in 2015. Xu has also received Cheriton Fellowship from the University of Waterloo 2013-2015.
Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo. He received his PhD in computer science from Purdue University, West Lafayette. His main research is in the area of database systems, with special interest in data quality, managing uncertain data, rank-aware query processing, and information extraction. Ihab is a recipient of the Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is a co-founder of Tamr, a startup focusing on large-scale data integration and cleaning. He serves on the VLDB Board of Trustees, and he is an associate editor of the ACM Transactions of Database Systems (TODS).
Sanjay Krishnan is a Computer Science PhD candidate in the Algorithms, Machines, and People Lab (AMPLab) and in the Berkeley Laboratory for Automation Science and Engineering at UC Berkeley. His research studies techniques for data analytics on dirty data and data representation problems in physical systems.
Jiannan Wang is an Assistant Professor of Computing Science at Simon Fraser University. His research is focused on developing algorithms and systems for extracting value from dirty data. Prior to that, he was a postdoc in the AMPLab at UC Berkeley. He obtained his PhD from the Computer Science Department at Tsinghua University. During his PhD, he has been a visiting scholar at Chinese University of Hong Kong and UC Berkeley, and an intern at Qatar Computing Research Institute. His PhD research work was supported from a Google PhD Fellowship, a Boeing Scholarship, and a New PhD Researcher Award by Chinese Ministry of Education. His PhD dissertation won the China Computer Federation (CCF) Distinguished Dissertation Award. His similarity-join algorithm won first place of EDBT String Similarity Search/Join Competition.