Categorical Data Encoding Techniques Explained
To analyze categorical data, we need to convert them into numerical format. In this article, we will discuss different encoding techniques for converting categorical data into numeric format.
How to Convert Categorical Data into Numerical Data?
You can use the following encoding techniques to convert categorical into numeric data.
- Label encoding
- One-hot encoding
- Target encoding
- Entity Encoding
Let us discuss all the categorical data encoding techniques one by one.
Label Encoding
Label encoding is one of the easiest techniques for converting categorical data into numeric format. In label encoding, we just assign a unique integer to each categorical value. For example, consider that we have the following data containing city names.
Sl. No. | City |
1 | London |
2 | Paris |
3 | New York |
4 | Mumbai |
5 | New Delhi |
Now, we will assign a unique integer starting from 0 to each city name as shown below.
Sl. No. | City | Numeric Value |
1 | London | 0 |
2 | Paris | 1 |
3 | New York | 2 |
4 | Mumbai | 3 |
5 | New Delhi | 4 |
In the above table, we have used label encoding to convert the categorical data to a numeric format. Here, you can observe that the numeric labels have no meaning. For instance, we have the value 0 to London and 4 to New Delhi randomly. Even if we assign 0 to New Delhi and 4 to London, the meaning of the data won’t change. However, statistical and machine-learning algorithms might misinterpret these values giving the ranking 0 to London and 4 to New Delhi. Hence, Label encoding is of no use for nominal data types.
We can use label encoding for ordinal data types. Ordinal categorical data has an intrinsic order. Due to this, if we assign numeric labels in order of the rankings of the categorical labels, we will get meaningful numeric labels.
For example, consider that we have the following categories of customer reviews.
Sl. No. | Review Label |
1 | Very Poor |
2 | Poor |
3 | Average |
4 | Good |
5 | Very Good |
Now, let us use label encoding to convert the categorical labels to numeric format as shown below.
Sl. No. | Review Label | Numeric Value |
1 | Very Poor | 0 |
2 | Poor | 1 |
3 | Average | 2 |
4 | Good | 3 |
5 | Very Good | 4 |
In the above table, we have labeled the categorical values in increasing order of the review label. Here, the numeric values have a specific meaning as the worst review has been assigned the value 0 and the best review has been assigned the value 4.
Hence, we can say that review with a value of 3 is better than a review with a value of 1. Thus, we can use label encoding to convert ordinal categorical data into numeric form without losing the meaning of the data labels.
One Hot Encoding
For encoding nominal data, one hot encoding is a better technique than label encoding. In one hot encoding, we transform the categories into an array of 0s and 1s.
- In the array, the number of columns is equal to the number of unique values in the categorical data.
- Each column in the array corresponds to a unique categorical variable and acts as a new variable.
- Each row in the array corresponds to a data point.
- To populate a cell corresponding to a particular row or column, we check if the variable corresponding to the current column is originally present in the current row. If yes, we set the current cell to 1. Otherwise, it is set to 0.
To understand this, consider that we have the following data containing 10 rows and 5 unique values.
Sl. No. | City |
1 | London |
2 | Mumbai |
3 | New York |
4 | New Delhi |
5 | Mumbai |
6 | Paris |
7 | New York |
8 | Mumbai |
9 | New Delhi |
10 | London |
As there are 5 unique values in the City column, we will add 5 new columns to the dataset. Here, each column will correspond to a particular categorical variable as shown below.
Sl. No. | City | City_London | City_Mumbai | City_NewYork | City_Paris | City_NewDelhi |
1 | London | |||||
2 | Mumbai | |||||
3 | New York | |||||
4 | New Delhi | |||||
5 | Mumbai | |||||
6 | Paris | |||||
7 | New York | |||||
8 | Mumbai | |||||
9 | New Delhi | |||||
10 | London |
In the above table, we will fill each cell. For this, we will set each cell to 1 if the city corresponding to the particular column is the same as the city given in the same row. Otherwise, we will fill the value 0 in the given cell. After this, we will get the following table.
Sl. No. | City | City_London | City_Mumbai | City_NewYork | City_Paris | City_NewDelhi |
1 | London | 1 | 0 | 0 | 0 | 0 |
2 | Mumbai | 0 | 1 | 0 | 0 | 0 |
3 | New York | 0 | 0 | 1 | 0 | 0 |
4 | New Delhi | 0 | 0 | 0 | 0 | 1 |
5 | Mumbai | 0 | 1 | 0 | 0 | 0 |
6 | Paris | 0 | 0 | 0 | 1 | 0 |
7 | New York | 0 | 0 | 1 | 0 | 0 |
8 | Mumbai | 0 | 1 | 0 | 0 | 0 |
9 | New Delhi | 0 | 0 | 0 | 0 | 1 |
10 | London | 1 | 0 | 0 | 0 | 0 |
In the above table, we have converted the categorical variables into 5 columns with numerical values. Hence, each city or categorical variable is represented using a vector of 0s and 1s. For example, the categorical value London is represented using the vector [1,0,0,0,0].
Again, the new columns can have any order, and the vector corresponding to a particular categorical value can be different too. However, one-hot encoding doesn’t misrepresent the data by introducing any order in the numeric values, unlike the label encoding approach.
Although one hot encoding solves the problem of misrepresentation of the values, it runs into another major problem. If there are a lot of unique values for a particular categorical attribute, the dataset will become sparse as we need to add as many columns as the number of unique categorical values.
For example, if a categorical attribute has 30 unique values, we will have to add 30 columns to the dataset. Also, if there are 5 categorical attributes with 30 unique values each, we need to add 30*5 i.e. 150 new columns in the dataset. Due to this, the dataset will become very sparse. Thus, one hot encoding introduces sparsity in the dataset, which is its major drawback.
Suggested Reading: data visualization best practices
Target Encoding
As the name suggests, target encoding replaces a categorical variable with the mean or median of a target numeric variable. To understand this, consider the following dataset.
Grade | Marks |
A | 86 |
B | 75 |
A | 91 |
C | 65 |
A | 90 |
B | 71 |
A | 89 |
If we want to encode the categorical attribute Grade using Target Encoding, we will take the mean of Marks where the grade is A, B, and C separately.
- The mean of the rows in the Marks column with grade A is 89.
- The mean of the rows in the Marks column with grade B is 73.
- The mean of the rows in the Marks column with grade C is 65.
Hence, we will impute the mean values in the place of grades as shown below.
Grade | Marks |
89 | 86 |
73 | 75 |
89 | 91 |
65 | 65 |
89 | 90 |
73 | 71 |
89 | 89 |
Target encoding enables us to perform categorical data encoding easily if there is a numeric target attribute. However, if we don’t have a numeric target attribute, we can’t perform target encoding.
Entity Embedding
Entity embedding is one of the most recent and advanced techniques for encoding categorical data. In entity encoding, we use neural networks to create numerical embedding for categorical values. Here, we first create a unique numerical embedding consisting of one or more columns for each unique value in the categorical column. The number of embedding columns that replaces the categorical column is decided using the unique values present in the categorical column.
For example, consider that we have the following dataset.
Grade | Marks |
A | 86 |
B | 75 |
A | 91 |
C | 65 |
A | 90 |
B | 71 |
A | 89 |
Now, there are 3 unique values in the Grade column. So, we can replace it with a numerical column using entity embedding as shown below.
Grade | Marks |
1.3000624 | 86 |
-0.518414 | 75 |
1.3000624 | 91 |
-0.5756403 | 65 |
1.3000624 | 90 |
-0.518414 | 71 |
1.3000624 | 89 |
I understand that you might be guessing how we obtained the numeric values. To understand this, you can read this article on entity embedding in Python.
Conclusion
In this article, we discussed different categorical data encoding techniques. To learn more about categorical data processing, you can read this article on KModes clustering in Python. You might also like this article on the apriori algorithm numerical example.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy learning!