Data Exploration in Data Mining

Introduction to Data Exploration – Feature Engineering and Feature Selection in Data Mining

In this article, I will discuss,

How to read the dataset?
How to know the data types of columns?
general Data Description
Univariate analysis and Bi-Variate Analysis

Video Tutorial – Data Exploration in Data Mining

Click here to download the titanic.csv file, the dataset used in this demonstration.

First, we will import the required libraries like pandas, numpy, seaborn, matplotlib, and explore from data_exploration.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
plt.style.use('seaborn-colorblind')
%matplotlib inline
from data_exploration import explore

Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.

use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived']

data = pd.read_csv('./data/titanic.csv', usecols=use_cols)

Now we display the first five rows, to confirm whether the dataset is read successfully or not using the data.head(5) function.

	Survived	Pclass	Sex	AgeAge	SibSp	Fare
O	O	3	male	22.0	1	7.2500
1	1	1	female	38.0	1	71.2833
2	1	3	female	26.0	O	7.9250
3	1	1	female	35.0	1	53.1000
4	O	3	male	35.0	O	8.0500

First 5 rows of the dataset

Univariate Analysis

Below are some methods that can give us the basic stats on the variable:

pandas.Dataframe.dtypes
pandas.Dataframe.describe()
Barplot
Countplot
Boxplot
Distplot

pandas.Dataframe.dtypes

Now we use the get_dtypes() function to get the types of each column and display them.

str_var_list, num_var_list, all_var_list = explore.get_dtypes(data=data)
print(str_var_list) # string type
print(num_var_list) # numeric type
print(all_var_list) # all

Output:

pandas.Dataframe.describe()

Next, we use the describe() function to get the general description of dataset. The describe() function displays different statistics like, count, unique values, frequency, mean, standard deviation, minimum, maximum, 25%, 50% and 75% percentile.

explore.describe(data=data,output_path=r'./output/')

Out of describe() function():

	Survived	Pclass	Sex	Age	SibSp	Fare
count	891.000000	891.000000	891	714.000000	891.000000	891.000000
unique	NaN	NaN	2	NaN	NaN	NaN
top	NaN	NaN	male	NaN	NaN	NaN
freq	NaN	NaN	577	NaN	NaN	NaN
mean	0.383838	2.308642	NaN	29.699118	0.523008	32.204208
std	0.486592	0.836071	NaN	14.526497	1.102743	49.693429
min	0.000000	1.000000	NaN	0.420000	0.000000	0.000000
25%	0.000000	2.000000	NaN	20.125000	0.000000	7.910400
50%	0.000000	3.000000	NaN	28.000000	0.000000	14.454200
75%	1.000000	3.000000	NaN	38.000000	1.000000	31.000000
max	1.000000	3.000000	NaN	80.000000	8.000000	512.329200

Data Description

Discrete variable barplot

discrete_var_barplot() function is used to draw the barplot of a discrete variable x against y (that is target variable). By default, the bar shows the mean value of y.

explore.discrete_var_barplot(x='Pclass',y='Survived',data=data,output_path='./output/')

Discrete variable countplot

discrete_var_countplot() function is used to draw the countplot of a discrete variable x.

explore.discrete_var_countplot(x='Pclass',data=data,output_path='./output/')

Discrete variable boxplot

discrete_var_boxplot() function is used to draw the boxplot of a discrete variable x against y.

explore.discrete_var_boxplot(x='Pclass',y='Fare',data=data,output_path='./output/')

Bi-variate Analysis

Bi-variate Analysis is performed to understand the descriptive statistics between two or more variables.

Scatter Plot
Correlation Plot
Heat Map

Continuous variable distplot

continuous_var_distplot() issued to draw the distplot of a continuous variable x.

explore.continuous_var_distplot(x=data['Fare'],output_path='./output/')

Correlation plot

correlation_plot() function I used to draw the correlation plot between variables.

explore.correlation_plot(data=data,output_path='./output/')

Summary

This article introduces the Data Exploration – FeatureEngineering and Feature Selection in Data Mining. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Data Exploration in Data Mining

Computer Graphics OpenGL Mini Projects

Download Final Year Projects

Introduction to Data Exploration – Feature Engineering and Feature Selection in Data Mining

Video Tutorial – Data Exploration in Data Mining

Univariate Analysis

pandas.Dataframe.dtypes

pandas.Dataframe.describe()

Discrete variable barplot

Discrete variable countplot

Discrete variable boxplot

Bi-variate Analysis

Continuous variable distplot

Correlation plot

Summary

Related Posts

Leave a Comment Cancel Reply

Tutorials

Our Services

Join us at

Contact Us

Computer Graphics OpenGL Mini Projects

Download Final Year Projects

Introduction to Data Exploration – Feature Engineering and Feature Selection in Data Mining

Video Tutorial – Data Exploration in Data Mining

Univariate Analysis

pandas.Dataframe.dtypes

pandas.Dataframe.describe()

Discrete variable barplot

Discrete variable countplot

Discrete variable boxplot

Bi-variate Analysis

Continuous variable distplot

Correlation plot

Summary

Related Posts

Leave a Comment Cancel Reply

Welcome to VTUPulse.com

Computer Graphics and Image Processing Mini Projects -> Click Here

Download Final Year Project -> Click Here