Entity Embedding in Python

We often use categorical data encoding techniques such as label encoding and one hot encoding while data preprocessing. While these techniques offer an easy solution to convert categorical data to a numeric format, the representations are often inaccurate. In this article, we will discuss how to perform entity embedding to convert categorical data into a numeric format while preserving all the characteristics of the original data. We will also implement entity embedding in Python using the Tensorflow and Keras modules.

What is Entity Embedding?

Entity embedding is a technique in which we use neural networks to convert categorical data to a numerical format. In entity embedding, we represent categorical values in a tabular dataset using continuous numeric values in multiple dimensions.

For example, consider that we have the following data. 

NameCity
John SmithNew York
Aditya RajMumbai
Will SmithLondon
Harsh AryanLondon
Joel HarrisonMumbai
Bill WarnerParis
Chris KiteNew York
Sam AltmanLondon
JoeLondon
Data For Entity Embedding

If we convert the City column into a numerical format using entity embedding, we will get an output as follows.

NameCityCity_1City_2
John SmithNew York0.7213730.392310
Aditya RajMumbai-1.045558-0.285206
Will SmithLondon-1.1003850.259384
Harsh AryanLondon-1.1003850.259384
Joel HarrisonMumbai-1.045558-0.285206
Bill WarnerParis-0.2608391.056758
Chris KiteNew York0.7213730.392310
Sam AltmanLondon-1.1003850.259384
JoeLondon-1.1003850.259384
Data Encoded Using Entity Embedding

In the above table, you can observe that we have created two new columns City_1 and City_2. These columns represent the categorical values in the City column. But, how did we get these values? Let’s see.

What Embeddings Really Are?

Embeddings are continuous vector representations assigned to categorical variables. When we train a neural network using categorical values, the embedding vectors are created during the training process of a neural network. These vectors capture the underlying relationships and similarities between different categorical values. 

By representing categorical variables as continuous embedding vectors, we can effectively capture complex relationships and similarities between the values in a column. After creating the embeddings, we can use them as input to machine learning models to perform tasks like classification, regression, or recommendation.

How Many Dimensions Should We Create For a Column During Entity Embedding?

The vectors created using entity embedding are typically low-dimensional and have dense representations. This is in contrast to high-dimensional and sparse representations used in traditional methods like one-hot encoding. In entity encoding, each categorical value is mapped to a fixed-size vector, where each element of the vector represents a feature or attribute of the category. Here, you need to keep in mind that higher-dimensional embeddings can more accurately represent the relationships between values in a column.

However, increasing the dimensions in the embedding vectors increases the chance of overfitting. It also leads to slower training of the model. Hence, we use an empirical rule-of-thumb to define the number of dimensions in the embedding vector to be equal to ∜(Unique values in a column).

Why Should We Use Entity Embedding to Convert Categorical Data into Numerical Format?

We already have simpler techniques like label encoding and one hot encoding to convert categorical data into numerical format. Then, why should we use entity embeddings?

Following are some of the reasons why we should use entity embedding instead of one hot encoding while converting categorical data to a numerical format.

  •  Entity embedding produces a compact numerical representation compared to one hot encoding. If there are N unique values in a column, one hot encoding will generate N new columns while converting data into the numerical format. On the other hand, entity encoding can represent the same data in only ∜N features without using much information. Hence, entity embedding reduces sparsity in the data to a large extent.
  • In one hot encoding, the numeric values are in the form of 0s and 1s represented in a sparse manner. On the other hand, entity embedding produces continuous values. Due to this, entity encoding performs better and represents the true relationship between the data points.
  • One hot encoding ignores the relations between different values in a column. On the contrary, entity embeddings can map related values closer together in embedding space. Thus, it preserves the inherent continuity of the data

Looking at the above benefits, you can easily say that entity embeddings are a better option than one hot encoding. Hence, we should always prefer to use entity embeddings while converting categorical data to numerical format while data processing.

How to Perform Entity Embedding in Python?

To perform entity embedding in Python, we will use the TensorFlow and Keras modules. For this, we will use different functions as discussed below.

The categorical_column_with_vocabulary_list() Function

We use the categorical_column_with_vocabulary_list() function to create a VocabularyListCategoricalColumn object. It takes the column name as its first input argument and a list of unique values in the particular column as its second input argument. After execution, it returns a VocabularyListCategoricalColumn object. You can observe this in the following example.

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
print(vocab_list)

Output:

VocabularyListCategoricalColumn(key='Cities', vocabulary_list=('New York', 'Mumbai', 'London', 'Paris'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

In the above code, we have passed “Cities” as the feature name and four values in the vocabulary list. Here, you need to make sure that the values in the list are unique. Otherwise, the program will run into error.

After creating the VocabularyListCategoricalColumn, we can use the embedding_column() function to train a neural network object for generating embeddings.

The embedding_column() Function

The embedding_column() function takes the VocabularyListCategoricalColumn object as its first input argument and the desired number of features in the embedded data as its second input argument. After execution, it creates a trained EmbeddingColumn object as shown below.

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
print(embedding_column)

Output:

EmbeddingColumn(categorical_column=VocabularyListCategoricalColumn(key='Cities', vocabulary_list=('New York', 'Mumbai', 'London', 'Paris'), dtype=tf.string, default_value=-1, num_oov_buckets=0), dimension=2, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x7f3c77a21b70>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True, use_safe_embedding_lookup=True)

In the above code, we have created an EmbeddingColumn by specifying the dimensions of the embeddings as 2. We can use this EmbeddingColumn object to generate entity embeddings using the DenseFeatures() function.

The DenseFeatures() Function

The DenseFeatures() function takes the trained EmbeddingColumn object as its input argument and returns a trained DenseFeatures() function as shown below.

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
feature_layer=layers.DenseFeatures(embedding_column)
print(feature_layer)

Output:

<keras.feature_column.dense_features_v2.DenseFeatures object at 0x7f3c34723820>

Create Entity Embeddings Using The DenseFeatures() Function

We can use the DenseFeatures() function to generate entity embeddings for a column in our data. For this, we can pass a dictionary containing the column name and a list of values in the column that we passed to the categorical_column_with_vocabulary_list() function. After execution, the DenseFeatures() function returns a Tensor object with embeddings as shown below. 

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
feature_layer=layers.DenseFeatures(embedding_column)
value_dict={"Cities": ["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]}
tensor_obj=feature_layer(value_dict)
print(tensor_obj)

Output:

tf.Tensor(
[[-1.001625   -0.76165915]
 [ 0.25127193 -0.481     ]
 [ 0.5141091   0.18663265]
 [ 0.5141091   0.18663265]
 [ 0.25127193 -0.481     ]
 [-0.13489066 -0.5079209 ]
 [-1.001625   -0.76165915]
 [ 0.5141091   0.18663265]
 [ 0.5141091   0.18663265]], shape=(9, 2), dtype=float32)

In this code, we have passed a dictionary containing “Cities” as its key and a list containing different city names as its associated value to the object containing the DenseFeatures() function. After execution, we get a Tensor object containing 2-D vectors. Here, each vector represents a categorical value passed in the list given in the dictionary. You can observe that the same values get the same embedding vector as the output.

You can convert the above embeddings into a numpy array by invoking the numpy() method on the Tensor object. 

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
feature_layer=layers.DenseFeatures(embedding_column)
value_dict={"Cities": ["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]}
tensor_obj=feature_layer(value_dict)
feature_matrix=tensor_obj.numpy()
print(feature_matrix)

Output:

[[-0.23178566  1.0528516 ]
 [-1.3448706   0.08130983]
 [ 0.6036284   0.01220271]
 [ 0.6036284   0.01220271]
 [-1.3448706   0.08130983]
 [ 0.00780506  0.10220684]
 [-0.23178566  1.0528516 ]
 [ 0.6036284   0.01220271]
 [ 0.6036284   0.01220271]]

Finally, you can convert the numpy array to dataframe columns for representing the values given in the input as shown below. 

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
feature_layer=layers.DenseFeatures(embedding_column)
value_dict={"Cities": ["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]}
tensor_obj=feature_layer(value_dict)
feature_matrix=tensor_obj.numpy()
df=pd.DataFrame(feature_matrix,columns=["City_1","City_2"])
print("The dataframe with embeddings is:")
print(df)

Output:

The dataframe with embeddings is:
     City_1    City_2
0  0.758967 -0.290070
1 -0.756442 -0.193602
2  1.143431  0.574248
3  1.143431  0.574248
4 -0.756442 -0.193602
5  0.210023 -0.441719
6  0.758967 -0.290070
7  1.143431  0.574248
8  1.143431  0.574248

Entity Embedding on a Pandas DataFrame in Python

To perform entity encoding on a column in the pandas dataframe, we will first obtain the unique values in the given column as a list. Then, we will create embeddings for the values as discussed in the previous sections. Finally, we will merge the column containing the embedding values in the original dataframe as shown in the following example.

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
df=pd.read_csv("sample_file .csv")
print("The input dataframe is:")
print(df)
column_values=df["City"].unique()
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("City",column_values )
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
feature_layer=layers.DenseFeatures(embedding_column)
value_dict={"City": df["City"].values}
tensor_obj=feature_layer(value_dict)
feature_matrix=tensor_obj.numpy()
df_columns=pd.DataFrame(feature_matrix,columns=["City_1","City_2"])
df[["City_1","City_2"]]=df_columns
print("The dataframe with embeddings is:")
print(df)

Output:

The input dataframe is:
            Name      City
0     John Smith  New York
1     Aditya Raj    Mumbai
2     Will Smith    London
3    Harsh Aryan    London
4  Joel Harrison    Mumbai
5    Bill Warner     Paris
6     Chris Kite  New York
7     Sam Altman    London
8            Joe    London
The dataframe with embeddings is:
            Name      City    City_1    City_2
0     John Smith  New York  0.969665 -0.645429
1     Aditya Raj    Mumbai -0.320367  0.248256
2     Will Smith    London  0.710551 -0.027302
3    Harsh Aryan    London  0.710551 -0.027302
4  Joel Harrison    Mumbai -0.320367  0.248256
5    Bill Warner     Paris  0.177772 -0.151322
6     Chris Kite  New York  0.969665 -0.645429
7     Sam Altman    London  0.710551 -0.027302
8            Joe    London  0.710551 -0.027302

In the above example, we first loaded the data given in the previous table in a pandas dataframe. Then, we extracted the unique values in the "City" column using the unique() method. After this, created embeddings of the data using the functions discussed in the previous sections. Finally, we create a new dataframe using the embedding columns and append it to the original dataframe.

In the outputs, you can observe that we get a different embedding for the values every time we perform entity embedding. Hence, it is important to store the embeddings or at least the trained DenseFeatures() function so that you can reproduce the results while data pre-processing.

Conclusion

In this article, we discussed the basics of entity embedding. We also discussed how to implement entity embedding in Python. To learn more about machine learning topics, you can read this article on fp growth algorithm numerical example. You might also like this article on linear regression vs logistic regression.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Similar Posts