Feature Scaling Normalization Standardization in Data Mining
In this article, I will discuss,
- Normalization – Standardization (Z-score scaling)
- Min-Max scaling
- Robust scaling
Video Tutorial – Feature Scaling Normalization Standardization
Click here to download the dataset titanic.csv file, which is used in this article for demonstration.
First, we will import the required libraries like pandas, NumPy, os, and train_test_split from sklearn.model_selection.
import pandas as pd import numpy as np import os from sklearn.model_selection import train_test_split
Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.
use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived'] data = pd.read_csv('./data/titanic.csv', usecols=use_cols) print(data.shape) data.head(3)
Now we display the first eight rows, to confirm whether the dataset is read successfully or not using the data.head(3) function. Also, the shape of the dataset is displayed using the shape function. In this case shape of the dataset is (891, 6). It indicates that there are 861 rows and 6 columns are present in the dataset.
Survived | Pclass | Sex | Age | SibSp | Fare | |
O | O | 3 | male | 22.0 | 1 | 7.2500 |
---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 71.2833 |
2 | 1 | 3 | female | 26.0 | O | 7.9250 |
Note that we include the target variable in the X_train because we need it to supervise our discretization this is not the standard way of using train-test-split.
X_train, X_test, y_train, y_test = train_test_split(data, data, test_size=0.3, random_state=0) X_train.shape, X_test.shape
Output:
((623, 6), (268, 6))
Normalization – Standardization (Z-score scaling)
To check whether the data is already normalized. If the mean = 0 and standard deviation = 1, then the data is already normalized. Here there is no need to do feature scaling.
print(X_train['Fare'].mean()) print(X_train['Fare'].std()) Output: 32.458272552166925 48.257658284816124
The Z-score scaling is performed using the below formula.
z = (X – X.mean) / std
# add the new created feature from sklearn.preprocessing import StandardScaler ss = StandardScaler().fit(X_train[['Fare']]) X_train_copy = X_train.copy(deep=True) X_train_copy['Fare_zscore'] = ss.transform(X_train_copy[['Fare']]) print(X_train_copy.head(6)) Output: Survived Pclass Sex Age SibSp Fare Fare_zscore 857 1 1 male 51.0 0 26.5500 -0.122530 52 1 1 female 49.0 1 76.7292 0.918124 386 0 3 male 1.0 5 46.9000 0.299503 124 0 1 male 54.0 0 77.2875 0.929702 578 0 3 female NaN 1 14.4583 -0.373297 549 1 2 male 8.0 1 36.7500 0.089005
Now we find the mean and standard deviation.
print(X_train_copy['Fare_zscore'].mean()) print(X_train_copy['Fare_zscore'].std()) Output: 5.916437306188636e-17 1.0008035356861
Min-Max scaling
Scaled values are calculated in Min-Max scaling is performed using the below formula.
X_scaled = (X – X.min / (X.max – X.min)
from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler().fit(X_train[['Fare']]) X_train_copy = X_train.copy(deep=True) X_train_copy['Fare_minmax'] = mms.transform(X_train_copy[['Fare']]) print(X_train_copy.head(6)) Output: Survived Pclass Sex Age SibSp Fare Fare_minmax 857 1 1 male 51.0 0 26.5500 0.051822 52 1 1 female 49.0 1 76.7292 0.149765 386 0 3 male 1.0 5 46.9000 0.091543 124 0 1 male 54.0 0 77.2875 0.150855 578 0 3 female NaN 1 14.4583 0.028221 549 1 2 male 8.0 1 36.7500 0.071731
Robust scaling
Scaled values are calculated in Robust scaling according to the quantile range (defaults to IQR).
X_scaled = (X – X.median) / IQR
from sklearn.preprocessing import RobustScaler rs = RobustScaler().fit(X_train[['Fare']]) X_train_copy = X_train.copy(deep=True) X_train_copy['Fare_robust'] = rs.transform(X_train_copy[['Fare']]) print(X_train_copy.head(6)) Output: Survived Pclass Sex Age SibSp Fare Fare_robust 857 1 1 male 51.0 0 26.5500 0.492275 52 1 1 female 49.0 1 76.7292 2.630973 386 0 3 male 1.0 5 46.9000 1.359616 124 0 1 male 54.0 0 77.2875 2.654768 578 0 3 female NaN 1 14.4583 -0.023088 549 1 2 male 8.0 1 36.7500 0.927011
Summary
This article introduces Feature Scaling Normalization Standardization. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.