Clean Data: Tips, Tricks, and Techniques
“Give me six hours to chop down a tree and I will spend the first four sharpening the axe”? Do you apply the same principle when doing Data Science?
Effective data cleaning is one of the most important aspects of good Data Science and involves acquiring raw data and preparing it for analysis, which, if not done effectively, will not give you the accuracy or results that you’re looking to achieve, no matter how good your algorithm is.
Data Cleaning is the hardest part of big data and ML. To address this matter, this course will equip you with all the skills you need to clean your data in Python, using tried and tested techniques. You’ll find a plethora of tips and tricks that will help you get the job done, in a smart, easy, and efficient way.
About the Author
Tomasz Lelek is a software engineer who programs mostly in Java and Scala. He is a fan of microservice architectures and functional programming. He dedicates considerable time and effort to being better every day. Recently, he’s been delving into big data technologies such as Apache Spark and Hadoop. He is passionate about nearly everything associated with software development.
Tomasz thinks that we should always try to consider different solutions and approaches before solving a problem. Recently, he was a speaker at several conferences in Poland—Confitura and JDD (Java Developer’s Day)—and also at Krakow Scala User Group. You can find his JDD video ML Spark talk. He also conducted a live coding session at Geecon Conference. He is currently working with TypeScript. The following links showcase his work:
Identifying the Most Important Data Issues
This video provides an overview of the entire course.
In this video, we will learn how to set up our work environment and install Python.
• Open our project in PyCharm
• Validate that the environment works as expected
In this video, we will learn about standard deviation.
• Understand what normal distribution of data is
• Remove outliers using normal distribution and standard deviation
In this video, we will learn how to remove outliers in a more robust way.
• Understand the IQR method
• Learn new ways to eliminate outliers
In this video, we will learn how to implement the IQR method.
• Test the IQR method
Cleaning Text Data
In this video, we will understand TypeScript implicit types.
• Learn how to tokenize input data into tokens
• Use Natural Language Toolkit
• Learn how to remove stop words from our input
In this video, we will learn how to find text specific words.
• Filter out words that have no semantic meaning
In this video, we will learn to diagnose what whitespace is and techniques to filter it from the program.
• Understand the concept of whitespace
• Learn how to filter out whitespaces
Dealing with Unstructured Data (Text)
In this video, we will learn how to deal with text data.
• Understand how to prepare data to extract features
In this video, we will learn algorithms for transforming text into a vector of numbers.
• Understand variable declaration scope
• Create Word2Vect
• Understand Skip-Gram
In this video, we will create a full-fledged TypeScript application.
• Use gensim for Word2Vec implementation
In this video, we will learn to use React with TypeScript.
• Understand the Skip-Gram method
In this video, we will learn how to create a lazy iterator.
• Drop duplicated columns
We will leverage generics for reusable code.
• Find duplicates on all columns in a dataset
In this video, we will understand what idempotent logic is.
• Understand how to implement an idempotent logic
In this video, we will delve into the pandas library.
• Use counter to implement de-duplication logic
Reasoning about Types and Default
In this video, we will learn how to detect missing values (NaN).
• Leverage the IsNull() helper method
• Install pandas
In this video, we will learn how to make NaN meaningful to processing.
• Replace NaN with scalar values
In this video, we will define the Ad Validator module and understand what a backward fill is.
• Understand what a forward fill is
In this video, we will learn how to handle an outlier by replacing it with meaningful name.
• Implementing logic using replace