COM 3590: Data Cleaning & Transformation

Medina, Francis Patricia

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.12202/7724

Title:	COM 3590: Data Cleaning & Transformation
Authors:	Medina, Francis Patricia
Keywords:	data cleaning data transformation computer science
Issue Date:	Jan-2021
Citation:	Medina, Francis Patricia. (2021, Spring), Syllabus, COM 3590: Data Cleaning & Transformation, Yeshiva College, Yeshiva University.
Series/Report no.:	Yeshiva College Syllabi;COM 3590
Abstract:	Description In real-world situations, data scientists must be able to use data from many dirty, autonomous, and heterogeneous data sources that are far from being ready to be analyzed. Preparing the data for analysis (often referred to as “data wrangling”) involves four different tasks: cleaning, sampling, transformation, and integration.  Cleaning is the detection and removal of noise, i.e., dirty data, from a data set. Speaking very broadly, an instance is considered “dirty” if it is, in some way, inaccurate or duplicate.  Sampling is drawing a representative subset of the population of interest from the data set. Sampling may be used either to reduce the data set to a tractable size or to isolate the population of interest from the remainder of the data.  Transformation involves taking an existing data set and mapping it from its existing schema to the schema required for the desired analysis. May include restructuring the schema and/or enriching it with additional data from other sources.  Integration is the process of combining two or more sets of data into a consistent, unified view. The data to be integrated often is stored in multiple data sources which differ in their storage formats, query languages, schema/metadata languages, and provenance. Integration occurs at both the schema and instance levels, and includes entity resolution, which is the detection of when multiple data instances refer to the same real-world entity. For each of these tasks, interactive tools are useful both for preparing small data sets as well as for investigating the general quality or structure of a large data set. When dealing with large data sets measuring in many thousands or millions of rows, however, programmatic quantitative approaches are an absolute necessity to make data preparation a realistic task. This course covers both interactive tools and quantitative approaches to each of these tasks. Because data preparation is a focus of significant R&D and small advances may have major impacts on one’s productivity, the course also introduces students to the communities of research and practice that continue to advance the state of the art enabling students to stay abreast of valuable advances in this area. Course Outcomes  Students will be able to apply descriptive statistics to explore a data set  Students will be able to use data visualization tools to understand and explain the characteristics of a data set  Students will be able to write programs to clean data sets  Students will be able to use transformation, integration, and sampling to derive new data sets, that are ready for analysis, from existing data sets
Description:	Course syllabus / YU only
URI:	https://hdl.handle.net/20.500.12202/7724
Appears in Collections:	Yeshiva College Syllabi -- 2021 - 2022 courses (past versions for reference ONLY) -- COMP SCI (Computer Science)

Files in This Item:

File	Description	Size	Format
COM-3590 Data Cleaning and Transformation MEDINA O.pdf Restricted Access		803.24 kB	Adobe PDF	View/Open

Show full item record Recommend this item

This item is licensed under a Creative Commons License