In real-world situations, data scientists must be able to use data from many dirty, autonomous, and heterogeneous data sources that are far from being ready to be analyzed. Preparing the data for analysis (often referred to as “data wrangling”) involves four different tasks: cleaning, sampling, transformation, and integration.
Cleaning is the detection and removal of noise, i.e., dirty data, from a data set. Speaking very broadly, an instance is considered “dirty” if it is, in some way, inaccurate or duplicate.
Sampling is drawing a representative subset of the population of interest from the data set. Sampling may be used either to reduce the data set to a tractable size or to isolate the population of interest from the remainder of the data.
Transformation involves taking an existing data set and mapping it from its existing schema to the schema required for the desired analysis. May include restructuring the schema and/or enriching it with additional data from other sources.
Integration is the process of combining two or more sets of data into a consistent, unified view. The data to be integrated often is stored in multiple data sources which differ in their storage formats, query languages, schema/metadata languages, and provenance. Integration occurs at both the schema and instance levels, and includes entity resolution, which is the detection of when multiple data instances refer to the same real-world entity.
For each of these tasks, interactive tools are useful both for preparing small data sets as well as for investigating the general quality or structure of a large data set. When dealing with large data sets measuring in many thousands or millions of rows, however, programmatic quantitative approaches are an absolute necessity to make data preparation a realistic task. This course covers both interactive tools and quantitative approaches to each of these tasks. Because data preparation is a focus of significant R&D and small advances may have major impacts on one’s productivity, the course also introduces students to the communities of research and practice that continue to advance the state of the art enabling students to stay abreast of valuable advances in this area.
Students will be able to apply descriptive statistics to explore a data set
Students will be able to use data visualization tools to understand and explain the characteristics of a data set
Students will be able to write programs to clean data sets
Students will be able to use transformation, integration, and sampling to derive new data sets, that are ready for analysis, from existing data sets||en_US