Dummy Variables
What - A dummy variable is a numerical variable that represents
categorical variables.
Why – A lot of machine learning algorithms cannot work with
categorical variables directly, they need to be converted to numbers.
How – There are multiple ways of handling Categorical variables
1. Label Encoding
2. One-Hot Encoding
Label Encoding
Each categorical label is simply assigned a unique integer.
Country Age Salary Country Age Salary
India 44 32000 0 44 32000
US 34 33400 2 34 33400
Japan 43 45000 1 43 45000
US 23 23000 2 23 23000
Japan 23 67000 1 23 67000
An effective technique when categorical data is ordinal.
Challenge – Country is a nominal variable, there is no inherent ordering, Label encoding creates ranks for
countries. For eg here: India < Japan < US.
This will affect model interpretation.
We can use one-hot encoding to overcome this.
One-Hot Encoding
One hot encoding is a representation of categorical variables as binary vectors.
It creates additional features based on the number of unique labels in the categorical feature
Country Age Salary Country.India Country.Japan Country.US Age Salary
India 44 32000 1 0 0 44 32000
US 34 33400 0 0 1 34 33400
Japan 43 45000 0 1 0 43 45000
US 23 23000
0 0 1 23 23000
Japan 23 67000
0 1 0 23 67000
3 new features are added in place of Country
We solved the problem of ranking as each category is represented by a binary vector.
Apply this technique when the categorical data is not ordinal
Challenges – If number of categories is high, it can lead to high dimensionality.
Note : For One Hot Encoding
The regression model won't actually need all the dummy variables.
It doesn't need the final dummy variable as it can deduce that information from the combination of
all other dummy variables!
To avoid multicollinearity, drop one dummy variable (use n-1 of them for model building).