Data Preprocessing in Data Mining


Introduction to Data Preprocessing – Feature Engineering and Feature Selection in Data Mining

In this article, I will discuss,

  • Motivation for Data Preprocessing,
  • Steps in Data Preprocessing

Motivation for Data Preprocessing

Real-world datasets are highly influenced by negative factors such as the presence of noise, missing values, redundancy, outliers, and inconsistencies. A low-quality dataset will leads to poor performance or failure of machine learning or deep learning project.

Now a day’s, a large number of Machine Learning, Deep Learning, and transfer learning algorithms were designed. But the success or failure of these models largely depends on the quality of the data set used and the features selected.

Hence, Data Preprocessing also known as Feature Engineering & Feature Selection plays a very important stage in building a useable machine learning or deep learning project.

Video Tutorial:

There are mainly two steps in data preprocessing:

  1. Data Preparation
  2. Data Reduction

Following are the forms of Data Preparation

Forms of Data preparation
Forms of Data preparation

Data Cleaning

Data cleaning is the process of Correcting the bad data, filter out incorrect data from the data set, and reduce the unnecessary detail of data.

Data Transformation

Data Transformation is the process of consolidation of data so that the mining process result could be applied or maybe more efficient.

Data Integration

Collecting and Merging the data from multiple data stores.

Data Normalization

Data Normalization is the process to express data in the same measurements such as units, scale, or range.

Missing Data Imputation

The collected data may contain missing values, Imputation method is used to fill the variables that contain missing values with some intuitive data.

Noise Identification

To detect random errors or variances in a measured variable.

Following are the Forms of Data Reduction

Forms of Data Reduction

Feature Selection

Achieves the reduction of the data set by removing irrelevant or redundant features (or dimensions).

Instance Selection

Consists of choosing a subset of the total available data to achieve the original purpose of the application as if the whole data had been used.


Transforms quantitative data into qualitative data, that is, numerical attributes into nominal attributes with a finite number of intervals.

Feature Extraction/Instance Generation –Extends both the feature and instance selection by allowing the modification of the internal values that represent each example or attribute.


This article introduces Data Preprocessing – FeatureEngineering and Feature Selection in Data Mining. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *