Hello Everyone ! 🙏
Warm Up!
The amount of data we produce every day is truly mind-boggling. There are 2.5 quintillion bytes of data created each day at our current pace.
This is Mind Blowing - Isn't ?
- Internet
- Social Media
- Communication
- Digital Photos
- Running services like weather channel etc
- Internet of Things (IoT)
Data Processing
Data Processing Process
Importing Libraries
Numpy
Pandas
Matplotlib
Importing Datasets
Using pandas lib we can import datasets for operation and further computing
iloc function defined in Python which includes range of rows & columns values. In variable X it loads all the rows & all the columns except last one(as -1 indicates) while in Y variable it loads all the rows but only last col (as no range is defined only last col indices is define)
Taking care of missing data
It is very common in real world applications for our training examples to be missing with one or two values etc. Unfortunately, most computational tools are unable to handle such missing values or will produce unpredictable results if we simply ignore them. Therefore, it is crucial that we take care of those missing values before we proceed with further analyses.
How to correct data or Eliminate training data with missing values:
Easiest way - just to remove data with missing values.
One disadvantage - may end up with removing too many samples, which will make reliable analysis impossible.
Another way - inputting missing values: In this case, we can use different interpolation techniques to estimate the missing values from the other training examples in our datasets. One of the most common interpolation techniques is mean imputation, where we simply replace the missing value with the mean value of the entire feature column.
Mean imputation - achieve by using the SimpleImputer class from scikit-learn
Other options for the strategy parameter are median or most_frequent, where the latter replaces the missing values with the most frequent values.
Why ?
Typically, any structured datasets includes multiple columns – a combination of numerical as well as categorical variables. A machine can only understand the numbers. It cannot understand the text. That’s primarily the reason we need to convert categorical columns to numeric columns so that a machine learning algorithm understands it. This process is called categorical encoding.
When we are talking about categorical data, we have to further distinguish between ordinal and nominal features.
Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order: XL > L > M.
Nominal features don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue.
color size price classlabel
0 green M 10.1 class2
1 red L 13.5 class1
2 blue XL 15.3 class2
Size - Ordinal Feature
Classlabel - Nominal Feature
Techniques used for encoding categorical data :
Label Encoding
One-Hot Encoding
Encoding Dependent Variable
Label Encoding
Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.
For Example :
Data :
As you can see here, the first column, Country, is the categorical feature as it is represented by the object data type and the rest of them are numerical features.
Implementation of Label Encoding :
Output :
Source Code:
Encoding the dependent variable
One-Hot Encoding
One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature.
One-Hot Encoding is the process of creating dummy variables.
Challenges of One-Hot Encoding: Dummy Variable Trap
One-Hot Encoding results in a Dummy Variable Trap as the outcome of one variable can easily be predicted with the help of the remaining variables.
The Dummy Variable Trap leads to the problem known as multicollinearity. Multicollinearity occurs where there is a dependency between the independent features. Multicollinearity is a serious issue in machine learning models like Linear Regression and Logistic Regression.
So, in order to overcome the problem of multicollinearity, one of the dummy variables has to be dropped
Source Code :
This question generally depends on your dataset and the model which you wish to apply. But still, a few points to note before choosing the right encoding technique for your model:
We apply One-Hot Encoding when:
If categorical feature is not ordinal (like the color above)
The number of categorical features is less so one-hot encoding can be effectively applied
We apply Label Encoding when:
- The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)
- The number of categories is quite large as one-hot encoding can lead to high memory consumption
Splitting into Training & Test set
A convenient way to randomly partition this dataset into separate test and training datasets is to use the train_test_split function from scikit-learn's model_selection submodule:
Feature Scaling
Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units.
One of the crucial step in data preprocessing
Decision Tree & Random forest are two of the very few machine learning algorithms where we don't need to worry about Feature Scaling because this algorithms are scale invariant.
Two approaches :
Normalization
Standardization
Normalization
It refers to the re-scaling of the features to a range of [0, 1], which is a special case of min-max scaling. To normalize our data, we can simply apply the min-max scaling to each feature column,where the new value, x(i)norm , of an example, x(i), can be calculated as follows:
Here, 𝑥(i) is a particular example, 𝑥𝑚i𝑛 is the smallest value in a feature column, and 𝑥𝑚ax is the largest value. The min-max scaling procedure is implemented in scikit-learn and can be used as follows:
Standardization
Using standardization, we center the feature columns at mean 0 with standard deviation 1 so that the feature columns have the same parameters as a standard normal distribution (zero mean and unit variance), which makes it easier to learn the weights.
Furthermore, standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which scales the data to a limited range of values.
The procedure for standardization can be expressed by the following equation:
Here, 𝜇𝑥 is the sample mean of a particular feature column, and 𝜎𝑥 is the corresponding standard deviation.
Awesome simple and easy explanation.
ReplyDeleteThanks
ReplyDelete