DaQuaTa International Workshop 2017

Data Profiling

Felix Naumann

Abstract:

Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.

Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, current data profiling techniques hardly scale beyond what can only be called small data. Finally, more and more data beyond the traditional relational databases are being created and beg to be profiled, including linked open data. The talk highlights the state of the art, use cases in data quality, and proposes new research directions and challenges.

Bio:
Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. He completed his PhD thesis on "Quality-driven Query Answering" in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on topics of data integration. From 2003 - 2006 he was assistant professor for information integration at the Humboldt-University of Berlin. Since then he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany. He is editor-in-chief of Information Systems and his research interests are in data profiling, data cleansing, and text mining.

Lyon, France - December, 11-12 , 2017

Data Profiling

Felix Naumann