How to Detect and Handle Outliers – Feature Engineering and Feature Selection in Data Mining
In this article, I will discuss,
- Detect outliers by an arbitrary boundary
- Detect outliers using Interquartile Ranges Rule
- Detect outliers using Mean and Standard Deviation Method
- Imputation of outliers with an arbitrary value
- Imputation of outliers with Mean, Median, Mode
- How to Discard outliers
Video Tutorial – How to Detect and Handle Outliers
Click here to download the dataset titanic.csv file, which is used in this article for demonstration.
First, we will import the required libraries like pandas, NumPy, os, and outlier from feature_cleaning.
import pandas as pd import numpy as np import os from feature_cleaning import outlier as ot
Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.
use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived'] data = pd.read_csv('./data/titanic.csv', usecols=use_cols) print(data.shape) data.head(8)
Now we display the first eight rows, to confirm whether the dataset is read successfully or not using the data.head(3) function. Also, the shape of the dataset is displayed using the shape function. In this case shape of the dataset is (891, 6). It indicates that there are 861 rows and 6 columns are present in the dataset.
Survived | Pclass | Sex | Age | SibSp | Fare | |
O | O | 3 | male | 22.0 | 1 | 7.2500 |
---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 71.2833 |
2 | 1 | 3 | female | 26.0 | O | 7.9250 |
Detect outliers by an arbitrary boundary
Use the outlier_detect_arbitrary() function to find the outliers based on arbitrary boundaries.
index,para = ot.outlier_detect_arbitrary(data=data,col='Fare',upper_fence=300,lower_fence=5) print('Upper bound:',para[0],'\nLower bound:',para[1])
Num of outlier detected: 19
The proportion of outliers detected was 0.02132435465768799
Upper bound: 300
Lower bound: 5
Use the following function to display the detected outliers.
data.loc[index,'Fare'].sort_values() Output: 179 0.0000 806 0.0000 732 0.0000 674 0.0000 633 0.0000 597 0.0000 815 0.0000 466 0.0000 481 0.0000 302 0.0000 277 0.0000 271 0.0000 263 0.0000 413 0.0000 822 0.0000 378 4.0125 679 512.3292 737 512.3292 258 512.3292
Detect outliers using Interquartile Ranges Rule
As shown in the diagram, in Interquartile Ranges Rule, first we find the Q1 and Q3. The difference between Q1 and Q3 that is IRQ is calculated. Finally, minimum and maximum boundaries are calculated using the formula given in the below diagram. Anything which falls below minimum and above the maximum is said to be an outlier.
index,para = ot.outlier_detect_IQR(data=data,col='Fare',threshold=5) print('Upper bound:',para[0],'\nLower bound:',para[1])
Num of outlier detected: 31
The proportion of outliers detected was 0.03479236812570146
Upper bound: 146.448
Lower bound: -107.53760000000001
Use the following function to display the detected outliers.
data.loc[index,'Fare'].sort_values() Output: 31 146.5208 195 146.5208 305 151.5500 708 151.5500 297 151.5500 498 151.5500 609 153.4625 332 153.4625 268 153.4625 318 164.8667 856 164.8667 730 211.3375 779 211.3375 689 211.3375 377 211.5000 527 221.7792 700 227.5250 716 227.5250 557 227.5250 380 227.5250 299 247.5208 118 247.5208 311 262.3750 742 262.3750 341 263.0000 88 263.0000 438 263.0000 27 263.0000 679 512.3292 258 512.3292 737 512.3292
Detect outliers using Mean and Standard Deviation Method
index,para = ot.outlier_detect_mean_std(data=data,col='Fare',threshold=3) print('Upper bound:',para[0],'\nLower bound:',para[1])
Num of outlier detected: 20
The proportion of outliers detected was 0.02244668911335578
Upper bound: 181.2844937601173
Lower bound: -116.87607782296811
Use the following function to display the detected outliers.
data.loc[index,'Fare'].sort_values() Output: 779 211.3375 730 211.3375 689 211.3375 377 211.5000 527 221.7792 716 227.5250 700 227.5250 380 227.5250 557 227.5250 118 247.5208 299 247.5208 311 262.3750 742 262.3750 27 263.0000 341 263.0000 88 263.0000 438 263.0000 258 512.3292 737 512.3292 679 512.3292
Imputation of outliers with an arbitrary value
Here first we need to find the outliers using any method discussed above. Once the outliers are detected they can be handled using different methods. Here found the outliers using the arbitrary method and displayed and displayed examples from 261 to 272. For the fare column, examples 263, and 271 contain outliers.
index,para = ot.outlier_detect_arbitrary(data=data,col='Fare',upper_fence=300,lower_fence=5) data[259:273]
Survived | Pclass | Sex | Age | SibSp | Fare | |
261 | 1 | 3 | male | 3.0 | 4 | 31.3875 |
---|---|---|---|---|---|---|
262 | o | 1 | male | 52.0 | 1 | 79.6500 |
263 | o | 1 | male | 40.0 | o | 0.0000 |
264 | o | 3 | female | NaN | o | 7.7500 |
265 | o | 2 | male | 36.0 | o | 10.5000 |
266 | o | 3 | male | 16.0 | 4 | 39.6875 |
267 | 1 | 3 | male | 25.0 | 1 | 7.7750 |
268 | 1 | 1 | female | 58.0 | o | 153.4625 |
269 | 1 | 1 | female | 35.0 | o | 135.6333 |
270 | o | 1 | male | NaN | o | 31.0000 |
271 | 1 | 3 | male | 25.0 | o | 0.0000 |
272 | 1 | 2 | female | 41.0 | o | 19.5000 |
Now, we replace all outliers with an arbitrary value -999.
data2 = ot.impute_outlier_with_arbitrary(data=data,outlier_index=index,value=-999,col=['Fare']) data2[261:273]
Survived | Pclass | Sex | Age | SibSp | Fare | |
261 | 1 | 3 | male | 3.0 | 4 | 31.3875 |
---|---|---|---|---|---|---|
262 | o | 1 | male | 52.0 | 1 | 79.6500 |
263 | o | 1 | male | 40.0 | o | -999 |
264 | o | 3 | female | NaN | o | 7.7500 |
265 | o | 2 | male | 36.0 | o | 10.5000 |
266 | o | 3 | male | 16.0 | 4 | 39.6875 |
267 | 1 | 3 | male | 25.0 | 1 | 7.7750 |
268 | 1 | 1 | female | 58.0 | o | 153.4625 |
269 | 1 | 1 | female | 35.0 | o | 135.6333 |
270 | o | 1 | male | NaN | o | 31.0000 |
271 | 1 | 3 | male | 25.0 | o | -999 |
272 | 1 | 2 | female | 41.0 | o | 19.5000 |
Imputation of outliers with Mean
data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='mean') data2[261:273]
Survived | Pclass | Sex | Age | SibSp | Fare | |
261 | 1 | 3 | male | 3.0 | 4 | 31.3875 |
---|---|---|---|---|---|---|
262 | o | 1 | male | 52.0 | 1 | 79.6500 |
263 | o | 1 | male | 40.0 | o | 32.204208 |
264 | o | 3 | female | NaN | o | 7.7500 |
265 | o | 2 | male | 36.0 | o | 10.5000 |
266 | o | 3 | male | 16.0 | 4 | 39.6875 |
267 | 1 | 3 | male | 25.0 | 1 | 7.7750 |
268 | 1 | 1 | female | 58.0 | o | 153.4625 |
269 | 1 | 1 | female | 35.0 | o | 135.6333 |
270 | o | 1 | male | NaN | o | 31.0000 |
271 | 1 | 3 | male | 25.0 | o | 32.204208 |
272 | 1 | 2 | female | 41.0 | o | 19.5000 |
Imputation of outliers with Median
data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='medien') data2[261:273]
Survived | Pclass | Sex | Age | SibSp | Fare | |
261 | 1 | 3 | male | 3.0 | 4 | 31.3875 |
---|---|---|---|---|---|---|
262 | o | 1 | male | 52.0 | 1 | 79.6500 |
263 | o | 1 | male | 40.0 | o | 14.4542 |
264 | o | 3 | female | NaN | o | 7.7500 |
265 | o | 2 | male | 36.0 | o | 10.5000 |
266 | o | 3 | male | 16.0 | 4 | 39.6875 |
267 | 1 | 3 | male | 25.0 | 1 | 7.7750 |
268 | 1 | 1 | female | 58.0 | o | 153.4625 |
269 | 1 | 1 | female | 35.0 | o | 135.6333 |
270 | o | 1 | male | NaN | o | 31.0000 |
271 | 1 | 3 | male | 25.0 | o | 14.4542 |
272 | 1 | 2 | female | 41.0 | o | 19.5000 |
Imputation of outliers with Mode
data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='mode') data2[261:273]
Survived | Pclass | Sex | Age | SibSp | Fare | |
261 | 1 | 3 | male | 3.0 | 4 | 31.3875 |
---|---|---|---|---|---|---|
262 | o | 1 | male | 52.0 | 1 | 79.6500 |
263 | o | 1 | male | 40.0 | o | 8.0500 |
264 | o | 3 | female | NaN | o | 7.7500 |
265 | o | 2 | male | 36.0 | o | 10.5000 |
266 | o | 3 | male | 16.0 | 4 | 39.6875 |
267 | 1 | 3 | male | 25.0 | 1 | 7.7750 |
268 | 1 | 1 | female | 58.0 | o | 153.4625 |
269 | 1 | 1 | female | 35.0 | o | 135.6333 |
270 | o | 1 | male | NaN | o | 31.0000 |
271 | 1 | 3 | male | 25.0 | o | 8.0500 |
272 | 1 | 2 | female | 41.0 | o | 19.5000 |
How to Discard outliers
Finally, we can delete the rows with the outliers using the drop_outlier() function.
data4 = ot.drop_outlier(data=data,outlier_index=index) print (data4.shape)
Output is :
(872, 6)
It shows that 19 rows with outliers were removed. Hence only 872 rows out of 891 are remaining.
Summary
This article introduces How to Detect and Handle Outliers – FeatureEngineering and Feature Selection in Data Mining. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.