How to Detect and Handle Outliers

How to Detect and Handle Outliers – Feature Engineering and Feature Selection in Data Mining

In this article, I will discuss,

Detect outliers by an arbitrary boundary
Detect outliers using Interquartile Ranges Rule
Detect outliers using Mean and Standard Deviation Method
Imputation of outliers with an arbitrary value
Imputation of outliers with Mean, Median, Mode
How to Discard outliers

Video Tutorial – How to Detect and Handle Outliers

Click here to download the dataset titanic.csv file, which is used in this article for demonstration.

First, we will import the required libraries like pandas, NumPy, os, and outlier from feature_cleaning.

import pandas as pd
import numpy as np
import os
from feature_cleaning import outlier as ot

Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.

use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived']

data = pd.read_csv('./data/titanic.csv', usecols=use_cols)

print(data.shape)

data.head(8)

Now we display the first eight rows, to confirm whether the dataset is read successfully or not using the data.head(3) function. Also, the shape of the dataset is displayed using the shape function. In this case shape of the dataset is (891, 6). It indicates that there are 861 rows and 6 columns are present in the dataset.

Detect outliers by an arbitrary boundary

Use the outlier_detect_arbitrary() function to find the outliers based on arbitrary boundaries.

index,para = ot.outlier_detect_arbitrary(data=data,col='Fare',upper_fence=300,lower_fence=5)
print('Upper bound:',para[0],'\nLower bound:',para[1])

Num of outlier detected: 19
The proportion of outliers detected was 0.02132435465768799
Upper bound: 300
Lower bound: 5

Use the following function to display the detected outliers.

data.loc[index,'Fare'].sort_values()

Output:
179      0.0000
806      0.0000
732      0.0000
674      0.0000
633      0.0000
597      0.0000
815      0.0000
466      0.0000
481      0.0000
302      0.0000
277      0.0000
271      0.0000
263      0.0000
413      0.0000
822      0.0000
378      4.0125
679    512.3292
737    512.3292
258    512.3292

Detect outliers using Interquartile Ranges Rule

As shown in the diagram, in Interquartile Ranges Rule, first we find the Q1 and Q3. The difference between Q1 and Q3 that is IRQ is calculated. Finally, minimum and maximum boundaries are calculated using the formula given in the below diagram. Anything which falls below minimum and above the maximum is said to be an outlier.

index,para = ot.outlier_detect_IQR(data=data,col='Fare',threshold=5)
print('Upper bound:',para[0],'\nLower bound:',para[1])

Num of outlier detected: 31
The proportion of outliers detected was 0.03479236812570146
Upper bound: 146.448
Lower bound: -107.53760000000001

Use the following function to display the detected outliers.

data.loc[index,'Fare'].sort_values()

Output:
31     146.5208
195    146.5208
305    151.5500
708    151.5500
297    151.5500
498    151.5500
609    153.4625
332    153.4625
268    153.4625
318    164.8667
856    164.8667
730    211.3375
779    211.3375
689    211.3375
377    211.5000
527    221.7792
700    227.5250
716    227.5250
557    227.5250
380    227.5250
299    247.5208
118    247.5208
311    262.3750
742    262.3750
341    263.0000
88     263.0000
438    263.0000
27     263.0000
679    512.3292
258    512.3292
737    512.3292

Detect outliers using Mean and Standard Deviation Method

index,para = ot.outlier_detect_mean_std(data=data,col='Fare',threshold=3)
print('Upper bound:',para[0],'\nLower bound:',para[1])

Num of outlier detected: 20
The proportion of outliers detected was 0.02244668911335578
Upper bound: 181.2844937601173
Lower bound: -116.87607782296811

Use the following function to display the detected outliers.

data.loc[index,'Fare'].sort_values()

Output:
779    211.3375
730    211.3375
689    211.3375
377    211.5000
527    221.7792
716    227.5250
700    227.5250
380    227.5250
557    227.5250
118    247.5208
299    247.5208
311    262.3750
742    262.3750
27     263.0000
341    263.0000
88     263.0000
438    263.0000
258    512.3292
737    512.3292
679    512.3292

Imputation of outliers with an arbitrary value

Here first we need to find the outliers using any method discussed above. Once the outliers are detected they can be handled using different methods. Here found the outliers using the arbitrary method and displayed and displayed examples from 261 to 272. For the fare column, examples 263, and 271 contain outliers.

index,para = ot.outlier_detect_arbitrary(data=data,col='Fare',upper_fence=300,lower_fence=5)
data[259:273]

	Survived	Pclass	Sex	Age	SibSp	Fare
261	1	3	male	3.0	4	31.3875
262	o	1	male	52.0	1	79.6500
263	o	1	male	40.0	o	0.0000
264	o	3	female	NaN	o	7.7500
265	o	2	male	36.0	o	10.5000
266	o	3	male	16.0	4	39.6875
267	1	3	male	25.0	1	7.7750
268	1	1	female	58.0	o	153.4625
269	1	1	female	35.0	o	135.6333
270	o	1	male	NaN	o	31.0000
271	1	3	male	25.0	o	0.0000
272	1	2	female	41.0	o	19.5000

Now, we replace all outliers with an arbitrary value -999.

data2 = ot.impute_outlier_with_arbitrary(data=data,outlier_index=index,value=-999,col=['Fare'])
data2[261:273]

	Survived	Pclass	Sex	Age	SibSp	Fare
261	1	3	male	3.0	4	31.3875
262	o	1	male	52.0	1	79.6500
263	o	1	male	40.0	o	-999
264	o	3	female	NaN	o	7.7500
265	o	2	male	36.0	o	10.5000
266	o	3	male	16.0	4	39.6875
267	1	3	male	25.0	1	7.7750
268	1	1	female	58.0	o	153.4625
269	1	1	female	35.0	o	135.6333
270	o	1	male	NaN	o	31.0000
271	1	3	male	25.0	o	-999
272	1	2	female	41.0	o	19.5000

Imputation of outliers with Mean

data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='mean')
data2[261:273]

	Survived	Pclass	Sex	Age	SibSp	Fare
261	1	3	male	3.0	4	31.3875
262	o	1	male	52.0	1	79.6500
263	o	1	male	40.0	o	32.204208
264	o	3	female	NaN	o	7.7500
265	o	2	male	36.0	o	10.5000
266	o	3	male	16.0	4	39.6875
267	1	3	male	25.0	1	7.7750
268	1	1	female	58.0	o	153.4625
269	1	1	female	35.0	o	135.6333
270	o	1	male	NaN	o	31.0000
271	1	3	male	25.0	o	32.204208
272	1	2	female	41.0	o	19.5000

Imputation of outliers with Median

data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='medien')
data2[261:273]

	Survived	Pclass	Sex	Age	SibSp	Fare
261	1	3	male	3.0	4	31.3875
262	o	1	male	52.0	1	79.6500
263	o	1	male	40.0	o	14.4542
264	o	3	female	NaN	o	7.7500
265	o	2	male	36.0	o	10.5000
266	o	3	male	16.0	4	39.6875
267	1	3	male	25.0	1	7.7750
268	1	1	female	58.0	o	153.4625
269	1	1	female	35.0	o	135.6333
270	o	1	male	NaN	o	31.0000
271	1	3	male	25.0	o	14.4542
272	1	2	female	41.0	o	19.5000

Imputation of outliers with Mode

data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='mode')
data2[261:273]

	Survived	Pclass	Sex	Age	SibSp	Fare
261	1	3	male	3.0	4	31.3875
262	o	1	male	52.0	1	79.6500
263	o	1	male	40.0	o	8.0500
264	o	3	female	NaN	o	7.7500
265	o	2	male	36.0	o	10.5000
266	o	3	male	16.0	4	39.6875
267	1	3	male	25.0	1	7.7750
268	1	1	female	58.0	o	153.4625
269	1	1	female	35.0	o	135.6333
270	o	1	male	NaN	o	31.0000
271	1	3	male	25.0	o	8.0500
272	1	2	female	41.0	o	19.5000

How to Discard outliers

Finally, we can delete the rows with the outliers using the drop_outlier() function.

data4 = ot.drop_outlier(data=data,outlier_index=index)
print (data4.shape)

Output is :

(872, 6)

It shows that 19 rows with outliers were removed. Hence only 872 rows out of 891 are remaining.

Summary

This article introduces How to Detect and Handle Outliers – FeatureEngineering and Feature Selection in Data Mining. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

How to Detect and Handle Outliers

Computer Graphics OpenGL Mini Projects

Download Final Year Projects