Skip to main content

Machine Learning: Data Processing

Hello Everyone ! 🙏

Warm Up!

The amount of data we produce every day is truly mind-boggling. There are 2.5 quintillion bytes of data created each day at our current pace.

This is Mind Blowing - Isn't ?


We ever think from where all these data gets generated ?
  • Internet
  • Social Media
  • Communication
  • Digital Photos
  • Running services like weather channel etc
  • Internet of Things (IoT)
So there are many more things.

Data is remain in the form of only raw data until unless it got processed. Raw data is not able to predict and give meaningful message until unless it convert in to information. 
So how information developed ?
In simple words we can say - Data is processed to form Information. This information is useful for making decision and  predicting values. 
So now we get why raw data is useful. Data processing is very very basic step in Machine Learning. Below diagram explains about Steps in predicting models:
So in this section we will try to cover Step 1.


Data Processing

Data Processing Process

START

🔰

Importing libraries

🔰

Importing Dataset

🔰

Taking care of missing data

🔰

Encoding categorical data

🔰

Encoding independent variable 

🔰

Encoding Dependent variable

🔰

Splitting Dataset into Training & Test Set

🔰

Feature Scaling

🔰

END

Importing Libraries

  1. Numpy

  2. Pandas

  3. Matplotlib

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

Importing Datasets

Using pandas lib we  can import datasets for operation and further computing

 

dataset = pd.read_csv('Data.csv')

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, -1].values

iloc function defined in Python which includes range of rows & columns values. In variable X it loads all the rows & all the columns except last one(as -1 indicates) while in Y variable it loads all the rows but only last col (as no range is defined only last col indices is define)


Taking care of missing data 

 

It is very common in real world applications for our training examples to be missing with one or two values etc. Unfortunately, most computational tools are unable to handle such missing values or will produce unpredictable results if we simply ignore them. Therefore, it is crucial that we take care of those missing values before we proceed with further analyses.  

How to correct data or Eliminate training data with missing values:

  • Easiest way - just to remove data with missing values.

  • One disadvantage - may end up with removing too many samples, which will make reliable analysis impossible.

  • Another way - inputting missing values: In this case, we can use different interpolation techniques to estimate the missing values from the other training examples in our datasets. One of the most common interpolation techniques is mean imputation, where we simply replace the missing value with the mean value of the entire feature column.  

  • Mean imputation - achieve by using the SimpleImputer class from scikit-learn  

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])

  • Other options for the strategy parameter are median or most_frequent, where the latter replaces the missing values with the most frequent values. 

The SimpleImputer class belongs to the so-called transformer classes in scikit-learn, which are used for data transformation. The two essential methods of those estimators are fit and transform. The fit method is used to learn the parameters from the training data, and the transform method uses those parameters to transform the data.

Encoding Categorical Data

Why ? 

Typically, any structured datasets includes multiple columns – a combination of numerical as well as categorical variables. A machine can only understand the numbers. It cannot understand the text. That’s primarily the reason we need to convert categorical columns to numeric columns so that a machine learning algorithm understands it. This process is called categorical encoding.

When we are talking about categorical data, we have to further distinguish between ordinal and nominal features. 

Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order: XL > L > M. 

Nominal features don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue.

color size price classlabel

0 green M 10.1 class2

1 red L 13.5 class1

2 blue XL 15.3 class2  


Size       - Ordinal Feature

Classlabel - Nominal Feature


Techniques used for encoding categorical data :

  • Label Encoding

  • One-Hot Encoding

Encoding Dependent Variable

Label Encoding

Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

For Example : 

Data :

As you can see here, the first column, Country, is the categorical feature as it is represented by the object data type and the rest of them are numerical features.


Implementation of Label Encoding :

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y = le.fit_transform(y)

 

Output :

Source Code:

 

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y = le.fit_transform(y)


Encoding the dependent variable

One-Hot Encoding

One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature.

One-Hot Encoding is the process of creating dummy variables.

Challenges of One-Hot Encoding: Dummy Variable Trap

One-Hot Encoding results in a Dummy Variable Trap as the outcome of one variable can easily be predicted with the help of the remaining variables.

The Dummy Variable Trap leads to the problem known as multicollinearity. Multicollinearity occurs where there is a dependency between the independent features. Multicollinearity is a serious issue in machine learning models like Linear Regression and Logistic Regression.

So, in order to overcome the problem of multicollinearity, one of the dummy variables has to be dropped

Source Code :

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])],   

      remainder='passthrough')

X = np.array(ct.fit_transform(X))

 

When to use a Label Encoding vs. One Hot Encoding

This question generally depends on your dataset and the model which you wish to apply. But still, a few points to note before choosing the right encoding technique for your model:

We apply One-Hot Encoding when:

  1. If categorical feature is not ordinal (like the color above)

  2. The number of categorical features is less so one-hot encoding can be effectively applied


We apply Label Encoding when:
  1. The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)
  2. The number of categories is quite large as one-hot encoding can lead to high memory consumption


Splitting into Training & Test set

A convenient way to randomly partition this dataset into separate test and training datasets is to use the train_test_split function from scikit-learn's model_selection submodule: 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)


Feature Scaling

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units.

  • One of the crucial step in data preprocessing

  • Decision Tree & Random forest are two of the very few machine learning algorithms where we don't need to worry about Feature Scaling because this algorithms are scale invariant.

  • Two approaches :

    • Normalization

    • Standardization

Normalization

It refers to the re-scaling of the features to a range of [0, 1], which is a special case of min-max scaling. To normalize our data, we can simply apply the min-max scaling to each feature column,where the new value, x(i)norm , of an example, x(i), can be calculated as follows:

Here, 𝑥(i)  is a particular example, 𝑥𝑚i𝑛 is the smallest value in a feature column, and 𝑥𝑚ax is the largest value. The min-max scaling procedure is implemented in scikit-learn and can be used as  follows: 

>>> from sklearn.preprocessing import MinMaxScaler

>>> mms = MinMaxScaler()

>>> X_train_norm = mms.fit_transform(X_train)

>>> X_test_norm = mms.transform(X_test)  

Standardization

Using standardization, we center the feature columns at mean 0 with standard deviation 1 so that the feature columns have the same parameters as a standard normal distribution (zero mean and unit variance), which makes it easier to learn the weights. 

Furthermore, standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which scales the data to a limited range of values.  

The procedure for standardization can be expressed by the following equation:

Here, 𝜇𝑥 is the sample mean of a particular feature column, and 𝜎𝑥 is the corresponding standard deviation.  

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

X_test[:, 3:] = sc.transform(X_test[:, 3:])

Selecting Meaningful Features

If we notice that a model performs much better on a training datasets than on the test datasets, this observation is a strong indicator of overfitting. 

Overfitting means the model fits the parameters too closely with regard to the particular observations in the training datasets, but does not generalize well to new data; we say that the model has a high variance. The reason for the overfitting is that our model is too complex for the given training data. Common solutions to reduce the generalization error are as follows:

• Collect more training data

• Introduce a penalty for complexity via regularization

• Choose a simpler model with fewer parameters

• Reduce the dimensionality of the data 

Wrap-Up!

So now all set for experiments. In next session we will discuss about further operation on processed data .



Comments

Post a Comment

Popular posts from this blog

Regression Analysis

Regression Analysis ? To establish the possible relationship among different variables, various modes of statistical approaches are implemented, known as regression analysis. In order to understand how the variation in an independent variable can impact the dependent variable, regression analysis is specially molded out. Basically, regression analysis sets up an equation to explain the significant relationship between one or more predictors and response variables and also to estimate current observations. The regression outcomes lead to the identification of the direction, size, and analytical significance of the relationship between predictor and response where the dependent variable could be numerical or discrete in nature.  Regression models (both linear and non-linear) are used for predicting a real value, like salary for example. If your independent variable is time, then you are forecasting future values, otherwise your model is predicting present but unknown values. Regressi...