COM 3590: Data Cleaning & Transformation

Date

2021-01

Journal Title

Journal ISSN

Volume Title

Publisher

YU Faculty Profile

Abstract

Description In real-world situations, data scientists must be able to use data from many dirty, autonomous, and heterogeneous data sources that are far from being ready to be analyzed. Preparing the data for analysis (often referred to as “data wrangling”) involves four different tasks: cleaning, sampling, transformation, and integration.  Cleaning is the detection and removal of noise, i.e., dirty data, from a data set. Speaking very broadly, an instance is considered “dirty” if it is, in some way, inaccurate or duplicate.  Sampling is drawing a representative subset of the population of interest from the data set. Sampling may be used either to reduce the data set to a tractable size or to isolate the population of interest from the remainder of the data.  Transformation involves taking an existing data set and mapping it from its existing schema to the schema required for the desired analysis. May include restructuring the schema and/or enriching it with additional data from other sources.  Integration is the process of combining two or more sets of data into a consistent, unified view. The data to be integrated often is stored in multiple data sources which differ in their storage formats, query languages, schema/metadata languages, and provenance. Integration occurs at both the schema and instance levels, and includes entity resolution, which is the detection of when multiple data instances refer to the same real-world entity. For each of these tasks, interactive tools are useful both for preparing small data sets as well as for investigating the general quality or structure of a large data set. When dealing with large data sets measuring in many thousands or millions of rows, however, programmatic quantitative approaches are an absolute necessity to make data preparation a realistic task. This course covers both interactive tools and quantitative approaches to each of these tasks. Because data preparation is a focus of significant R&D and small advances may have major impacts on one’s productivity, the course also introduces students to the communities of research and practice that continue to advance the state of the art enabling students to stay abreast of valuable advances in this area. Course Outcomes  Students will be able to apply descriptive statistics to explore a data set  Students will be able to use data visualization tools to understand and explain the characteristics of a data set  Students will be able to write programs to clean data sets  Students will be able to use transformation, integration, and sampling to derive new data sets, that are ready for analysis, from existing data sets

Description

Course syllabus / YU only

Keywords

data cleaning, data transformation, computer science

Citation

Medina, Francis Patricia. (2021, Spring), Syllabus, COM 3590: Data Cleaning & Transformation, Yeshiva College, Yeshiva University.