Categorical Data Encoding Techniques Explained

To analyze categorical data, we need to convert them into numerical format. In this article, we will discuss different encoding techniques for converting categorical data into numeric format.

How to Convert Categorical Data into Numerical Data?

You can use the following encoding techniques to convert categorical into numeric data. 

  1. Label encoding
  2. One-hot encoding
  3. Target encoding
  4. Entity Encoding

Let us discuss all the categorical data encoding techniques one by one.

Label Encoding

Label encoding is one of the easiest techniques for converting categorical data into numeric format. In label encoding, we just assign a unique integer to each categorical value. For example, consider that we have the following data containing city names. 

Sl. No.City
1London
2Paris
3New York
4Mumbai
5New Delhi
Data For Label Encoding

Now, we will assign a unique integer starting from 0 to each city name as shown below.

Sl. No.CityNumeric Value
1London0
2Paris1
3New York2
4Mumbai3
5New Delhi4
Encoded Data

In the above table, we have used label encoding to convert the categorical data to a numeric format. Here, you can observe that the numeric labels have no meaning. For instance, we have the value 0 to London and 4 to New Delhi randomly. Even if we assign 0 to New Delhi and 4 to London, the meaning of the data won’t change. However, statistical and machine-learning algorithms might misinterpret these values giving the ranking 0 to London and 4 to New Delhi. Hence, Label encoding is of no use for nominal data types.  

We can use label encoding for ordinal data types. Ordinal categorical data has an intrinsic order. Due to this, if we assign numeric labels in order of the rankings of the categorical labels, we will get meaningful numeric labels.

For example, consider that we have the following categories of customer reviews.

Sl. No.Review Label
1Very Poor
2Poor
3Average
4Good
5Very Good
Ordinal Data for label encoding

Now, let us use label encoding to convert the categorical labels to numeric format as shown below.

Sl. No.Review LabelNumeric Value
1Very Poor0
2Poor1
3Average2
4Good3
5Very Good4
Encoded data

In the above table, we have labeled the categorical values in increasing order of the review label. Here, the numeric values have a specific meaning as the worst review has been assigned the value 0 and the best review has been assigned the value 4.

Hence, we can say that review with a value of 3 is better than a review with a value of 1. Thus, we can use label encoding to convert ordinal categorical data into numeric form without losing the meaning of the data labels.

One Hot Encoding

For encoding nominal data, one hot encoding is a better technique than label encoding. In one hot encoding, we transform the categories into an array of 0s and 1s. 

  • In the array, the number of columns is equal to the number of unique values in the categorical data. 
  • Each column in the array corresponds to a unique categorical variable and acts as a new variable. 
  • Each row in the array corresponds to a data point. 
  • To populate a cell corresponding to a particular row or column, we check if the variable corresponding to the current column is originally present in the current row. If yes, we set the current cell to 1. Otherwise, it is set to 0. 

To understand this, consider that we have the following data containing 10 rows and 5 unique values. 

Sl. No.City
1London
2Mumbai
3New York
4New Delhi
5Mumbai
6Paris
7New York
8Mumbai
9New Delhi
10London
Data For One-Hot Encoding

As there are 5 unique values in the City column, we will add 5 new columns to the dataset. Here, each column will correspond to a particular categorical variable as shown below.

Sl. No.CityCity_LondonCity_MumbaiCity_NewYorkCity_ParisCity_NewDelhi
1London
2Mumbai
3New York
4New Delhi
5Mumbai
6Paris
7New York
8Mumbai
9New Delhi
10London
Intermediate Table in One-Hot Encoding

In the above table, we will fill each cell. For this, we will set each cell to 1 if the city corresponding to the particular column is the same as the city given in the same row.  Otherwise, we will fill the value 0 in the given cell. After this, we will get the following table.

Sl. No.CityCity_LondonCity_MumbaiCity_NewYorkCity_ParisCity_NewDelhi
1London10000
2Mumbai01000
3New York00100
4New Delhi00001
5Mumbai01000
6Paris00010
7New York00100
8Mumbai01000
9New Delhi00001
10London10000
One-Hot encoded data

In the above table, we have converted the categorical variables into 5 columns with numerical values. Hence, each city or categorical variable is represented using a vector of 0s and 1s. For example, the categorical value London is represented using the vector [1,0,0,0,0]. 

Again, the new columns can have any order, and the vector corresponding to a particular categorical value can be different too. However, one-hot encoding doesn’t misrepresent the data by introducing any order in the numeric values, unlike the label encoding approach. 

Although one hot encoding solves the problem of misrepresentation of the values, it runs into another major problem. If there are a lot of unique values for a particular categorical attribute, the dataset will become sparse as we need to add as many columns as the number of unique categorical values.

For example, if a categorical attribute has 30 unique values, we will have to add 30 columns to the dataset. Also, if there are 5 categorical attributes with 30 unique values each, we need to add 30*5 i.e. 150 new columns in the dataset. Due to this, the dataset will become very sparse. Thus, one hot encoding introduces sparsity in the dataset, which is its major drawback. 

Suggested Reading: data visualization best practices

Target Encoding

As the name suggests, target encoding replaces a categorical variable with the mean or median of a target numeric variable. To understand this, consider the following dataset.

GradeMarks
A86
B75
A91
C65
A90
B71
A89
Data For target encoding

If we want to encode the categorical attribute Grade using Target Encoding, we will take the mean of Marks where the grade is A, B, and C separately.

  • The mean of the rows in the Marks column with grade A is 89.
  • The mean of the rows in the Marks column with grade B is 73.
  • The mean of the rows in the Marks column with grade C is 65.

Hence, we will impute the mean values in the place of grades as shown below.

GradeMarks
8986
7375
8991
6565
8990
7371
8989
target encoded data

Target encoding enables us to perform categorical data encoding easily if there is a numeric target attribute. However, if we don’t have a numeric target attribute, we can’t perform target encoding. 

Entity Embedding

Entity embedding is one of the most recent and advanced techniques for encoding categorical data. In entity encoding, we use neural networks to create numerical embedding for categorical values. Here, we first create a unique numerical embedding consisting of one or more columns for each unique value in the categorical column. The number of embedding columns that replaces the categorical column is decided using the unique values present in the categorical column.

For example, consider that we have the following dataset. 

GradeMarks
A86
B75
A91
C65
A90
B71
A89
Data for entity embedding

Now, there are 3 unique values in the Grade column. So, we can replace it with a numerical column using entity embedding as shown below.

GradeMarks
1.300062486
-0.51841475
1.300062491
-0.575640365
1.300062490
-0.51841471
1.300062489
entity embedded data

I understand that you might be guessing how we obtained the numeric values. To understand this, you can read this article on entity embedding in Python. 

Conclusion

In this article, we discussed different categorical data encoding techniques. To learn more about categorical data processing, you can read this article on KModes clustering in Python. You might also like this article on the apriori algorithm numerical example

I hope you enjoyed reading this article. Stay tuned for more informative articles. 

Happy learning!

Similar Posts