Categorical Data Encoding Techniques Explained

To analyze categorical data, we need to convert them into numerical format. In this article, we will discuss different encoding techniques for converting categorical data into numeric format.

Table of Contents

How to Convert Categorical Data into Numerical Data?
Label Encoding
One Hot Encoding
Target Encoding
Entity Embedding
Conclusion

How to Convert Categorical Data into Numerical Data?

You can use the following encoding techniques to convert categorical into numeric data.

Label encoding
One-hot encoding
Target encoding
Entity Encoding

Let us discuss all the categorical data encoding techniques one by one.

Label Encoding

Label encoding is one of the easiest techniques for converting categorical data into numeric format. In label encoding, we just assign a unique integer to each categorical value. For example, consider that we have the following data containing city names.

Sl. No.	City
1	London
2	Paris
3	New York
4	Mumbai
5	New Delhi

Data For Label Encoding

Now, we will assign a unique integer starting from 0 to each city name as shown below.

Sl. No.	City	Numeric Value
1	London	0
2	Paris	1
3	New York	2
4	Mumbai	3
5	New Delhi	4

Encoded Data

In the above table, we have used label encoding to convert the categorical data to a numeric format. Here, you can observe that the numeric labels have no meaning. For instance, we have the value 0 to London and 4 to New Delhi randomly. Even if we assign 0 to New Delhi and 4 to London, the meaning of the data won’t change. However, statistical and machine-learning algorithms might misinterpret these values giving the ranking 0 to London and 4 to New Delhi. Hence, Label encoding is of no use for nominal data types.

We can use label encoding for ordinal data types. Ordinal categorical data has an intrinsic order. Due to this, if we assign numeric labels in order of the rankings of the categorical labels, we will get meaningful numeric labels.

For example, consider that we have the following categories of customer reviews.

Sl. No.	Review Label
1	Very Poor
2	Poor
3	Average
4	Good
5	Very Good

Ordinal Data for label encoding

Now, let us use label encoding to convert the categorical labels to numeric format as shown below.

Sl. No.	Review Label	Numeric Value
1	Very Poor	0
2	Poor	1
3	Average	2
4	Good	3
5	Very Good	4

Encoded data

In the above table, we have labeled the categorical values in increasing order of the review label. Here, the numeric values have a specific meaning as the worst review has been assigned the value 0 and the best review has been assigned the value 4.

Hence, we can say that review with a value of 3 is better than a review with a value of 1. Thus, we can use label encoding to convert ordinal categorical data into numeric form without losing the meaning of the data labels.

One Hot Encoding

For encoding nominal data, one hot encoding is a better technique than label encoding. In one hot encoding, we transform the categories into an array of 0s and 1s.

In the array, the number of columns is equal to the number of unique values in the categorical data.
Each column in the array corresponds to a unique categorical variable and acts as a new variable.
Each row in the array corresponds to a data point.
To populate a cell corresponding to a particular row or column, we check if the variable corresponding to the current column is originally present in the current row. If yes, we set the current cell to 1. Otherwise, it is set to 0.

To understand this, consider that we have the following data containing 10 rows and 5 unique values.

Sl. No.	City
1	London
2	Mumbai
3	New York
4	New Delhi
5	Mumbai
6	Paris
7	New York
8	Mumbai
9	New Delhi
10	London

Data For One-Hot Encoding

As there are 5 unique values in the City column, we will add 5 new columns to the dataset. Here, each column will correspond to a particular categorical variable as shown below.

Sl. No.	City	City_London	City_Mumbai	City_NewYork	City_Paris	City_NewDelhi
1	London
2	Mumbai
3	New York
4	New Delhi
5	Mumbai
6	Paris
7	New York
8	Mumbai
9	New Delhi
10	London

Intermediate Table in One-Hot Encoding

In the above table, we will fill each cell. For this, we will set each cell to 1 if the city corresponding to the particular column is the same as the city given in the same row. Otherwise, we will fill the value 0 in the given cell. After this, we will get the following table.

Sl. No.	City	City_London	City_Mumbai	City_NewYork	City_Paris	City_NewDelhi
1	London	1	0	0	0	0
2	Mumbai	0	1	0	0	0
3	New York	0	0	1	0	0
4	New Delhi	0	0	0	0	1
5	Mumbai	0	1	0	0	0
6	Paris	0	0	0	1	0
7	New York	0	0	1	0	0
8	Mumbai	0	1	0	0	0
9	New Delhi	0	0	0	0	1
10	London	1	0	0	0	0

One-Hot encoded data

In the above table, we have converted the categorical variables into 5 columns with numerical values. Hence, each city or categorical variable is represented using a vector of 0s and 1s. For example, the categorical value London is represented using the vector [1,0,0,0,0].

Again, the new columns can have any order, and the vector corresponding to a particular categorical value can be different too. However, one-hot encoding doesn’t misrepresent the data by introducing any order in the numeric values, unlike the label encoding approach.

Although one hot encoding solves the problem of misrepresentation of the values, it runs into another major problem. If there are a lot of unique values for a particular categorical attribute, the dataset will become sparse as we need to add as many columns as the number of unique categorical values.

For example, if a categorical attribute has 30 unique values, we will have to add 30 columns to the dataset. Also, if there are 5 categorical attributes with 30 unique values each, we need to add 30*5 i.e. 150 new columns in the dataset. Due to this, the dataset will become very sparse. Thus, one hot encoding introduces sparsity in the dataset, which is its major drawback.

Suggested Reading: data visualization best practices

Target Encoding

As the name suggests, target encoding replaces a categorical variable with the mean or median of a target numeric variable. To understand this, consider the following dataset.

Grade	Marks
A	86
B	75
A	91
C	65
A	90
B	71
A	89

Data For target encoding

If we want to encode the categorical attribute Grade using Target Encoding, we will take the mean of Marks where the grade is A, B, and C separately.

The mean of the rows in the Marks column with grade A is 89.
The mean of the rows in the Marks column with grade B is 73.
The mean of the rows in the Marks column with grade C is 65.

Hence, we will impute the mean values in the place of grades as shown below.

Grade	Marks
89	86
73	75
89	91
65	65
89	90
73	71
89	89

target encoded data

Target encoding enables us to perform categorical data encoding easily if there is a numeric target attribute. However, if we don’t have a numeric target attribute, we can’t perform target encoding.

Entity Embedding

Entity embedding is one of the most recent and advanced techniques for encoding categorical data. In entity encoding, we use neural networks to create numerical embedding for categorical values. Here, we first create a unique numerical embedding consisting of one or more columns for each unique value in the categorical column. The number of embedding columns that replaces the categorical column is decided using the unique values present in the categorical column.

For example, consider that we have the following dataset.

Grade	Marks
A	86
B	75
A	91
C	65
A	90
B	71
A	89

Data for entity embedding

Now, there are 3 unique values in the Grade column. So, we can replace it with a numerical column using entity embedding as shown below.

Grade	Marks
1.3000624	86
-0.518414	75
1.3000624	91
-0.5756403	65
1.3000624	90
-0.518414	71
1.3000624	89

entity embedded data

I understand that you might be guessing how we obtained the numeric values. To understand this, you can read this article on entity embedding in Python.

Conclusion

In this article, we discussed different categorical data encoding techniques. To learn more about categorical data processing, you can read this article on KModes clustering in Python. You might also like this article on the apriori algorithm numerical example.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy learning!

Categorical Data Encoding Techniques Explained

How to Convert Categorical Data into Numerical Data?

Label Encoding

One Hot Encoding

Target Encoding

Entity Embedding

Conclusion

MLOps: A Complete Guide For Beginners

Hierarchical Clustering for Categorical and Mixed Data Types in Python

K-Modes Clustering For Categorical Data in Python

KNN Regression Using sklearn Module in Python

Clustering For Mixed Data Types in Python

Apriori Algorithm Numerical Example

How to Convert Categorical Data into Numerical Data?

Label Encoding

One Hot Encoding

Target Encoding

Entity Embedding

Conclusion

Similar Posts