How to Handle Missing Values – Feature Engineering and Feature Selection in Data Mining
In this article, I will discuss,
- How to check the Missing values in the given dataset
- Listwise deletion – Deleting the missing values
- Arbitrary Value Imputation
- Mean/Median/Mode Imputation
- Random Imputation
Video Tutorial – Missing Values in Data Mining
Click here to download the dataset titanic.csv file, which is used in this article for demonstration.
First, we will import the required libraries.
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import os plt.style.use('seaborn-colorblind') %matplotlib inline from data_exploration import explore
Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.
use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived'] data = pd.read_csv('./data/titanic.csv', usecols=use_cols) print(data.shape) data.head(8)
Now we display the first eight rows, to confirm whether the dataset is read successfully or not using the data.head(8) function. Also, the shape of the dataset is displayed using the shape function. In this case shape of the dataset is (891, 6). It indicates that there are 861 rows and 6 columns are present in the dataset.
Survived | Pclass | Sex | Age | SibSp | Fare | |
O | O | 3 | male | 22.0 | 1 | 7.2500 |
---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 71.2833 |
2 | 1 | 3 | female | 26.0 | O | 7.9250 |
3 | 1 | 1 | female | 35.0 | 1 | 53.1000 |
4 | O | 3 | male | 35.0 | O | 8.0500 |
5 | O | 3 | male | NaN | O | 8.4583 |
6 | O | 1 | male | 54.0 | O | 51.8625 |
7 | O | 3 | male | 2.0 | 3 | 21.0750 |
Missing value checking
check_missing() function from the missing library is used to check the total number of missing values & percentage of missing values per variable of a Pandas Dataframe.
# only variable Age has missing values, totally 177 cases # result is saved at the output dir (if given) ms.check_missing(data=data,output_path=r'./output/')
total missing | proportion | |
Survived | O | 0.000000 |
---|---|---|
Pclass | O | 0.000000 |
Sex | O | 0.000000 |
Age | 177 | 0.198653 |
SibSp | O | 0.000000 |
Fare | O | 0.000000 |
Listwise deletion
drop_missing() is used to delete all examples (listwise) that have missing values. Next, we display the shape of the dataset after deleting the missing values. After deleting 177 rows from the original dataset, we left with (714, 6).
# 177 cases which has NA has been dropped data2 = ms.drop_missing(data=data) data2.shape
Add a variable to denote NA
add_var_denote_NA() function is used to create an additional variable indicating whether the data was missing for that observation.
# Age_is_NA is created, 0-not missing 1-missing for that observation data3 = ms.add_var_denote_NA(data=data,NA_col=['Age']) print(data3.Age_is_NA.value_counts()) data3.head(8)
The missing values are replaced to 1 and others are replaced with 0.
Survived | Pclass | Sex | Age | SibSp | Fare | Age_is_NA | |
O | O | 3 | male | 22.0 | 1 | 7.2500 | O |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 71.2833 | O |
2 | 1 | 3 | female | 26.0 | O | 7.9250 | O |
3 | 1 | 1 | female | 35.0 | 1 | 53.1000 | O |
4 | O | 3 | male | 35.0 | O | 8.0500 | O |
5 | O | 3 | male | NaN | O | 8.4583 | 1 |
6 | O | 1 | male | 54.0 | O | 51.8625 | O |
7 | O | 3 | male | 2.0 | 3 | 21.0750 | O |
Arbitrary Value Imputation
Arbitrary Value Imputation is a process where the missing values (represented by NA) are replaced with Arbitrary Values. impute_NA_with_arbitrary() function is used to replace NA with arbitrary value. Here NA is replaced with -999.
data4 = ms.impute_NA_with_arbitrary(data=data,impute_value=-999,NA_col=['Age']) data4.head(8)
Survived | Pclass | Sex | Age | SibSp | Fare | Age_-999 | |
O | O | 3 | male | 22.0 | 1 | 7.2500 | 22.0 |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 71.2833 | 38.0 |
2 | 1 | 3 | female | 26.0 | O | 7.9250 | 26.0 |
3 | 1 | 1 | female | 35.0 | 1 | 53.1000 | 35.0 |
4 | O | 3 | male | 35.0 | O | 8.0500 | 35.0 |
5 | O | 3 | male | NaN | O | 8.4583 | -999.0 |
6 | O | 1 | male | 54.0 | O | 51.8625 | 54.0 |
7 | O | 3 | male | 2.0 | 3 | 21.0750 | 2.0 |
Mean / Median / Mode Imputation
Missing values (NA) are replaced with mean, median, or Mode of that column. The impute_NA_with_avg() function is used to find the mean, median, and mode by setting the strategy as mean, median, or mode respectively.
print(data.Age.mean()) data5 = ms.impute_NA_with_avg(data=data,strategy='mean',NA_col=['Age']) data5.head(8) //Mean is 29.69911764705882
Survived | Pclass | Sex | Age | SibSp | Fare | Age_impute_mean | |
O | O | 3 | male | 22.0 | 1 | 7.2500 | 22.000000 |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 71.2833 | 38.000000 |
2 | 1 | 3 | female | 26.0 | O | 7.9250 | 26.000000 |
3 | 1 | 1 | female | 35.0 | 1 | 53.1000 | 35.000000 |
4 | O | 3 | male | 35.0 | O | 8.0500 | 35.000000 |
5 | O | 3 | male | NaN | O | 8.4583 | 29.699118 |
6 | O | 1 | male | 54.0 | O | 51.8625 | 54.000000 |
7 | O | 3 | male | 2.0 | 3 | 21.0750 | 2.000000 |
print(data.Age.mean()) data5 = ms.impute_NA_with_avg(data=data,strategy='median',NA_col=['Age']) data5.head(8) //Median is 28.0
Survived | Pclass | Sex | Age | SibSp | Fare | Age_impute_mean | |
O | O | 3 | male | 22.0 | 1 | 7.2500 | 22.000000 |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 71.2833 | 38.000000 |
2 | 1 | 3 | female | 26.0 | O | 7.9250 | 26.000000 |
3 | 1 | 1 | female | 35.0 | 1 | 53.1000 | 35.000000 |
4 | O | 3 | male | 35.0 | O | 8.0500 | 35.000000 |
5 | O | 3 | male | NaN | O | 8.4583 | 28.000000 |
6 | O | 1 | male | 54.0 | O | 51.8625 | 54.000000 |
7 | O | 3 | male | 2.0 | 3 | 21.0750 | 2.000000 |
print(data.Age.mean()) data5 = ms.impute_NA_with_avg(data=data,strategy='mode',NA_col=['Age']) data5.head(8) //Mode is 24
Survived | Pclass | Sex | Age | SibSp | Fare | Age_impute_mean | |
O | O | 3 | male | 22.0 | 1 | 7.2500 | 22.000000 |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 71.2833 | 38.000000 |
2 | 1 | 3 | female | 26.0 | O | 7.9250 | 26.000000 |
3 | 1 | 1 | female | 35.0 | 1 | 53.1000 | 35.000000 |
4 | O | 3 | male | 35.0 | O | 8.0500 | 35.000000 |
5 | O | 3 | male | NaN | O | 8.4583 | 24.000000 |
6 | O | 1 | male | 54.0 | O | 51.8625 | 54.000000 |
7 | O | 3 | male | 2.0 | 3 | 21.0750 | 2.000000 |
Random Imputation
Here a random number is generated from the pool of available observations. Then the missing values are replaced with a random value. The impute_NA_with_random() function is used to generate a random number.
data7 = ms.impute_NA_with_random(data=data,NA_col=['Age']) data7.head(8)
Survived | Pclass | Sex | Age | SibSp | Fare | Age_impute_mean | |
O | O | 3 | male | 22.0 | 1 | 7.2500 | 22.000000 |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 71.2833 | 38.000000 |
2 | 1 | 3 | female | 26.0 | O | 7.9250 | 26.000000 |
3 | 1 | 1 | female | 35.0 | 1 | 53.1000 | 35.000000 |
4 | O | 3 | male | 35.0 | O | 8.0500 | 35.000000 |
5 | O | 3 | male | NaN | O | 8.4583 | 28.000000 |
6 | O | 1 | male | 54.0 | O | 51.8625 | 54.000000 |
7 | O | 3 | male | 2.0 | 3 | 21.0750 | 2.000000 |
Summary
This article introduces How to Handle Missing Values – FeatureEngineering and Feature Selection in Data Mining. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.