Learn how to engineer features and build more powerful machine learning models.
This is the most comprehensive, yet easy to follow, course for feature engineering available online. Throughout this course you will learn a variety of techniques used worldwide for data cleaning and feature transformation, gathered from data competition websites, white papers, scientific articles, and from the instructor’s experience as a Data Scientist.
You will have at your fingertips, altogether in one place, a variety of techniques that you can apply to capture as much insight as possible with the features of your data set.
The course starts describing the most simple and widely used methods for feature engineering, and then describes more advanced and innovative techniques that automatically capture insight from your variables. It includes an explanation of the feature engineering technique, the rationale to use it, the advantages and limitations, and the assumptions the technique makes on the data. It also includes full code that you can then take on and apply to your own data sets.
This course is suitable for complete beginners in data science looking to learn their first steps into data pre-processing, as well as for intermediate and advanced data scientists seeking to level up their skills.
With more than 50 lectures and 10 hours of video this comprehensive course covers every aspect of variable transformation. The course includes several techniques for missing data imputation, categorical variable encoding, numerical variable transformation and discretisation, as well as how to extract useful features from date and time variables. Throughout the course we use python as our main language, and open source packages for feature engineering, including the package “Feature Engine” which was specifically designed for this course.
This course comes with a 30 day money back guarantee. In the unlikely event you don’t find this course useful, you’ll get your money back.
So what are you waiting for? Enrol today, embrace the power of feature engineering and build better machine learning models.
Introduction
Testing
Variable Types
Variable Characteristics
Table illustrating the advantages and disadvantages of different machine learning algorithms, as well as their requirements in terms of feature engineering, and common applications.
Engineering missing values (NA) in numerical variables
In this lecture, I describe complete case analysis, what it is, what assumptions it makes, and what are the implications and consequences of handling missing values using this method.
In this lecture, I describe what I mean by replacing missing values by the mean or median of the variable, what are the assumptions, advantages and disadvantages, and how they may affect the performance of machine learning algorithms.
In this lecture, I describe what random sample imputation, the advantages, and the cares that should be taken were this method to be implemented in a business setting.
Continues from previous lecture: in this lecture, I describe what random sample imputation, the advantages, and the cares that should be taken were this method to be implemented in a business setting.
Here I describe the process of adding one additional binary variable to capture those observations where data is missing.
Engineering missing values (NA) in categorical variables
Bonus: More on engineering missing values
Engineering outliers in numerical variables
In this lecture I will describe a common method to handle outliers in numerical variables. These methods are commonly used in surveys as well as in other business settings.
This lecture continues from the previous one.
I continue to describe a common method to handle outliers in numerical variables. These methods are commonly used in surveys as well as in other business settings.
Engineering rare values in categorical variables
In this lecture I will describe and compare 2 methods commonly used to replace rare labels. Rare labels are those categories within a categorical variable that contain very few observations, and therefore may affect tree based machine learning algorithm performance.
In this lecture I will focus on variables with one predominant category.
In this lecture I will describe and compare 2 methods commonly used to replace rare labels. Rare labels are those categories within a categorical variable that contain very few observations, and therefore may affect tree based machine learning algorithm performance.
In this lecture I will focus on variables with few categories.
In this lecture I will describe and compare 2 methods commonly used to replace rare labels. Rare labels are those categories within a categorical variable that contain very few observations, and therefore may affect tree based machine learning algorithm performance.
In this lecture I will focus on variables with high cardinality.
In this lecture I will focus on variables with several categories, using a different dataset, to get a better view of the benefits of engineering rare labels.