Introduction to Data Preprocessing – Feature Engineering and Feature Selection in Data Mining
In this article, I will discuss,
- Motivation for Data Preprocessing,
- Steps in Data Preprocessing
Motivation for Data Preprocessing
Real-world datasets are highly influenced by negative factors such as the presence of noise, missing values, redundancy, outliers, and inconsistencies. A low-quality dataset will leads to poor performance or failure of machine learning or deep learning project.
Now a day’s, a large number of Machine Learning, Deep Learning, and transfer learning algorithms were designed. But the success or failure of these models largely depends on the quality of the data set used and the features selected.
Hence, Data Preprocessing also known as Feature Engineering & Feature Selection plays a very important stage in building a useable machine learning or deep learning project.
Video Tutorial:
There are mainly two steps in data preprocessing:
- Data Preparation
- Data Reduction
Following are the forms of Data Preparation
Data Cleaning
Data cleaning is the process of Correcting the bad data, filter out incorrect data from the data set, and reduce the unnecessary detail of data.
Data Transformation
Data Transformation is the process of consolidation of data so that the mining process result could be applied or maybe more efficient.
Data Integration
Collecting and Merging the data from multiple data stores.
Data Normalization
Data Normalization is the process to express data in the same measurements such as units, scale, or range.
Missing Data Imputation
The collected data may contain missing values, Imputation method is used to fill the variables that contain missing values with some intuitive data.
Noise Identification
To detect random errors or variances in a measured variable.
Following are the Forms of Data Reduction
Feature Selection
Achieves the reduction of the data set by removing irrelevant or redundant features (or dimensions).
Instance Selection
Consists of choosing a subset of the total available data to achieve the original purpose of the application as if the whole data had been used.
Discretization
Transforms quantitative data into qualitative data, that is, numerical attributes into nominal attributes with a finite number of intervals.
Feature Extraction/Instance Generation –Extends both the feature and instance selection by allowing the modification of the internal values that represent each example or attribute.
Summary
This article introduces Data Preprocessing – FeatureEngineering and Feature Selection in Data Mining. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.