Cleaning Data For Effective Data Science
Released 6/2025
By David Mertz
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz, 2 Ch
Genre: eLearning | Language: English | Duration: 4h 49m | Size: 1.2 GB
Walks you through the real work of data science--getting to clean data
Overview
Cleaning Data for Effective Data Science introduces you to the first stage in the data science process--establishing useful data. This can often be the most important, as well as the most time-consuming, stage in the data science process.
Learn How To
Distinguish between different types of data formats
Work with tabular data formats
Work with hierarchical data formats
Extract and ingest data from repurposed sources
Detect and mark anomalous data
Assess data quality
Remediate missing or problematic data
Identify and address outliers
Impute plausible values for missing or unreliable data
Utilize sampling principles
Who Should Take This Course
Developers, data scientists, and engineers who are interested in improving the quality of datasets
Course Requirements
Proficiency in any programming language used for data processing and machine learning
Lesson Descriptions
Lesson 1: Data Ingestion--Tabular Formats
Lesson 1 is about tabular data formats. In most data science, the data we wish to work with has a tabular format, ideally a so-called tidy format. Sometimes that data starts out in physical formats, which are basically tabular, but many formats have characteristic pitfalls. CSV or fixed width data starts as plain text and often has data typing ambiguities. Spreadsheets are used to record data but are also prone to problems when performing more precise analysis. Other more specialized tabular formats avoid most of the problems but are used less often.
Lesson 2: Data Ingestion--Hierarchical Formats
Lesson 2 looks at the common hierarchical formats XML and JSON, and within NoSQL databases. A first goal is to convert the underlying data to a more tabular representation. But the specific formats also each have some characteristic dangers. XML is fairly complex though good tools exist for pulling out the data points from XML documents or streams. JSON has become ubiquitous and is generally a simple and useful format. Sometimes JSON takes the form of JSON lines, which represents many small documents streamed in sequence. JSON is loosely structured, which often merits using JSON schema to create more controlled hierarchies. Data also lives in NoSQL databases, sometimes called document databases. This lesson looks at utilizing data from those sources.
Lesson 3: Data Ingestion--Repurposing Data Sources
Sometimes data is embedded inside web pages, PDF documents, or within images. These common formats require initial work to pull out information for analytic visualization or modeling purposes. This lesson covers useful techniques for data extraction.
Lesson 4: Anomaly Detection
Collected data often contains anomalous data points. Detecting and marking these problem data is an important element of the data science pipeline. Data are sometimes simply missing, though how varies considerably among formats and sources. This lesson examines several ways absence is marked, including a discussion of various sentinels that are often used and how to work with those. Although sentinels can require work to recognize and transform, a trickier question arises with data points that are detectably wrong but have the general form of good data. These bad data points can sometimes result from direct miscoding. Other times, problems can be deduced based on known bounds of a data field or by extreme variance from expectations within the data. Each of these ways that data can go wrong is discussed in a sub lesson, as well as simple outliers, which can be identified by relatively easy statistical tests. Sometimes outliers need multivariate analysis to locate. The final sub lesson of this lesson addresses multivariate outliers.
Lesson 5: Data Quality
This lesson looks for systematic trends in the availability of data. Many trends can be characterized in statistical bias, which may result from sample bias. Sometimes bias is detectable from data distributions and at other times from domain knowledge that provides expectations. Benford's Law is also discussed as a mechanism for detecting a certain kind of bias. Remediation is addressed as well, including class imbalance in datasets. This may result from bias or may reflect the underlying distribution of data but nonetheless warrant sample weighting. A final sub lesson looks at normalization and scaling, which is also helpful in preparing data for many kinds of analysis and modeling. This lesson provides the tools to create a well-rounded data set for final steps in your data science pipeline.
Lesson 6: Value Imputation
The final lesson discusses value imputation. Earlier lessons may have identified or marked data points as missing or unreliable. Often a follow up to such marking is imputing new plausible values to that missing data. The sub lessons look first at imputation of typical values that might be either global or based on the parameter space locality. As a more sophisticated technique, imputation might also reflect trends identified within the data. Many trends are temporal, with timestamped data points tending to resemble those with similar timestamps. However, many trends are also non temporal, both by spatial location and following more abstract continuities. The final sub lesson reviews oversampling and undersampling. Sampling is a mechanism to impute data based on the data we already have. Oversampling can use techniques more nuanced than mere duplication of existing data rows.
Code:
Bitte
Anmelden
oder
Registrieren
um Code Inhalt zu sehen!