Show simple item record

dc.contributor.authorMedina, Francis Patricia
dc.identifier.citationMedina, Francis Patricia. (2021, Spring), Syllabus, COM 3590: Data Cleaning & Transformation, Yeshiva College, Yeshiva University.en_US
dc.descriptionCourse syllabus / YU onlyen_US
dc.description.abstractDescription In real-world situations, data scientists must be able to use data from many dirty, autonomous, and heterogeneous data sources that are far from being ready to be analyzed. Preparing the data for analysis (often referred to as “data wrangling”) involves four different tasks: cleaning, sampling, transformation, and integration.  Cleaning is the detection and removal of noise, i.e., dirty data, from a data set. Speaking very broadly, an instance is considered “dirty” if it is, in some way, inaccurate or duplicate.  Sampling is drawing a representative subset of the population of interest from the data set. Sampling may be used either to reduce the data set to a tractable size or to isolate the population of interest from the remainder of the data.  Transformation involves taking an existing data set and mapping it from its existing schema to the schema required for the desired analysis. May include restructuring the schema and/or enriching it with additional data from other sources.  Integration is the process of combining two or more sets of data into a consistent, unified view. The data to be integrated often is stored in multiple data sources which differ in their storage formats, query languages, schema/metadata languages, and provenance. Integration occurs at both the schema and instance levels, and includes entity resolution, which is the detection of when multiple data instances refer to the same real-world entity. For each of these tasks, interactive tools are useful both for preparing small data sets as well as for investigating the general quality or structure of a large data set. When dealing with large data sets measuring in many thousands or millions of rows, however, programmatic quantitative approaches are an absolute necessity to make data preparation a realistic task. This course covers both interactive tools and quantitative approaches to each of these tasks. Because data preparation is a focus of significant R&D and small advances may have major impacts on one’s productivity, the course also introduces students to the communities of research and practice that continue to advance the state of the art enabling students to stay abreast of valuable advances in this area. Course Outcomes  Students will be able to apply descriptive statistics to explore a data set  Students will be able to use data visualization tools to understand and explain the characteristics of a data set  Students will be able to write programs to clean data sets  Students will be able to use transformation, integration, and sampling to derive new data sets, that are ready for analysis, from existing data setsen_US
dc.relation.ispartofseriesYeshiva College Syllabi;COM 3590
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 United States*
dc.subjectdata cleaningen_US
dc.subjectdata transformationen_US
dc.subjectcomputer scienceen_US
dc.titleCOM 3590: Data Cleaning & Transformationen_US
dc.typeLearning Objecten_US

Files in this item


This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States