Clean Data: Tips, Tricks, and Techniques

Use Python to check your data consistency and get rid of any missing or duplicate data.
Instructor:
Packt Publishing
12 students enrolled
English [Auto-generated]
Learn to spot outliers in your data and analyze sensor data to find omissions.
Tokenize data and clean stop words to make it more robust.
Analyze and extract features from unstructured text data.
Clean and handle duplicates in your big data analytics and statistics.
Find and remove global row duplicates.
Learn to handle data cleaning for numbers.

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe”? Do you apply the same principle when doing Data Science?

Effective data cleaning is one of the most important aspects of good Data Science and involves acquiring raw data and preparing it for analysis, which, if not done effectively, will not give you the accuracy or results that you’re looking to achieve, no matter how good your algorithm is.

Data Cleaning is the hardest part of big data and ML. To address this matter, this course will equip you with all the skills you need to clean your data in Python, using tried and tested techniques. You’ll find a plethora of tips and tricks that will help you get the job done, in a smart, easy, and efficient way.

About the Author

Tomasz Lelek is a software engineer who programs mostly in Java and Scala. He is a fan of microservice architectures and functional programming. He dedicates considerable time and effort to being better every day. Recently, he’s been delving into big data technologies such as Apache Spark and Hadoop. He is passionate about nearly everything associated with software development.

Tomasz thinks that we should always try to consider different solutions and approaches before solving a problem. Recently, he was a speaker at several conferences in Poland—Confitura and JDD (Java Developer’s Day)—and also at Krakow Scala User Group. You can find his JDD video ML Spark talk. He also conducted a live coding session at Geecon Conference. He is currently working with TypeScript. The following links showcase his work:

Identifying the Most Important Data Issues

1
The Course Overview

This video provides an overview of the entire course.

2
Setting Up the Work Environment

In this video, we will learn how to set up our work environment and install Python.

   •  Open our project in PyCharm

   •  Validate that the environment works as expected

3
Finding Outliers in the Input Data

In this video, we will learn about standard deviation.

   •  Understand what normal distribution of data is

   •  Remove outliers using normal distribution and standard deviation

4
Reconcile Missing Values to Give Data More Meaning

In this video, we will learn how to remove outliers in a more robust way.

   •  Understand the IQR method

   •  Learn new ways to eliminate outliers

5
Implementing and Testing the IQR Method

In this video, we will learn how to implement the IQR method.

   •  Test the IQR method

Cleaning Text Data

1
Tokenizing Input Data

In this video, we will understand TypeScript implicit types.

   •  Learn how to tokenize input data into tokens

   •  Use Natural Language Toolkit

2
Cleaning Stop Words

In this video, we will understand the shortcomings of JavaScript.

   •  Learn how to remove stop words from our input

3
Removing Data-Specific Words That Has a Negative Impact

In this video, we will learn how to find text specific words.

   •  Filter out words that have no semantic meaning

4
Handling White Spaces and Language-Agnostic Phrases

In this video, we will learn to diagnose what whitespace is and techniques to filter it from the program.

   •  Understand the concept of whitespace

   •  Learn how to filter out whitespaces

Dealing with Unstructured Data (Text)

1
Analyzing Unstructured Text Input Data

In this video, we will learn how to deal with text data.

   •  Understand how to prepare data to extract features

2
Extracting Features from Data and Transforming Text into Vector

In this video, we will learn algorithms for transforming text into a vector of numbers.

   •  Understand variable declaration scope

   •  Create Word2Vect

   •  Understand Skip-Gram

3
Bag-Of-Words

In this video, we will create a full-fledged TypeScript application.

   •  Use gensim for Word2Vec implementation

4
Reducing Noise in Data by Using Skip-Gram

In this video, we will learn to use React with TypeScript.

   •  Understand the Skip-Gram method

Duplicates

1
Analyzing Rows – Finding Duplicate Columns

In this video, we will learn how to create a lazy iterator.

   •  Drop duplicated columns

2
Finding Global Row Duplicates

We will leverage generics for reusable code.

   •  Find duplicates on all columns in a dataset

3
Handling Duplicates by Implementing Idempotent Processing

In this video, we will understand what idempotent logic is.

   •  Understand how to implement an idempotent logic

4
Duplicates That Has Meaning

In this video, we will delve into the pandas library.

   •  Use counter to implement de-duplication logic

Reasoning about Types and Default

1
Interpreting Not a Number – Cleaning for Numeric Data

In this video, we will learn how to detect missing values (NaN).

   •  Leverage the IsNull() helper method

   •  Install pandas

2
Replacing NaN with Scalar Data

In this video, we will learn how to make NaN meaningful to processing.

   •  Replace NaN with scalar values

3
Backward Fill and Forward Fill

In this video, we will define the Ad Validator module and understand what a backward fill is.

   •  Understand what a forward fill is

4
Replacing Generic Values

In this video, we will learn how to handle an outlier by replacing it with meaningful name.

   •  Implementing logic using replace

You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!

Be the first to add a review.

Please, login to leave a review
85c7cc222328e34ee172d41edad711d6
30-Day Money-Back Guarantee

Includes

2 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion