Implement Label Encoding in Python and PySpark

To analyze categorical data, we often need to convert them into numerical values. Label encoding is one of the most straightforward data preprocessing techniques for encoding categorical data into numeric values. This article will discuss different ways to perform label encoding in Python and pyspark.

How to Perform Label Encoding?

To perform label encoding, we just need to assign unique numeric values to each definite value in the dataset. For instance, consider the following dataset.

NameCity
John SmithNew York
Aditya RajMumbai
Will SmithLondon
Harsh AryanLondon
Joel HarrisonMumbai
Bill WarnerParis
Chris KiteNew York
Sam AltmanLondon
JoeLondon
Input Data for label encoding

Now, if we have to perform label encoding on the City column in the above table, we will assign a unique numeric value to each city name. For example, we can assign the value 0 to New York, 1 to Mumbai, 2 to London, and 3 to Paris. After this, we will replace the City names with the numeric values as shown below.

NameCity
John Smith0
Aditya Raj1
Will Smith2
Harsh Aryan2
Joel Harrison1
Bill Warner3
Chris Kite0
Sam Altman2
Joe2
Label Encoded Data

Thus, we have assigned numeric labels to each city name using label encoding. Now, let us discuss different ways to perform label encoding in Python.

Label Encoding in Python Using the Sklearn Module

The sklearn module provides us with the LabelEncoder() function to perform label encoding in Python. To perform label encoding using the sklearn module in Python, we will use the following steps.

  • First, we will create an empty LabelEncoder object by executing the LabelEncoder() function. 
  • Then, we will train the LabelEncoder object using the fit() method. The fit() method takes the list containing categorical values and learns all the unique values. After execution, it returns a trained LabelEncoder object. 
  • Next, we can perform label encoding by invoking the transform() method on the trained LabelEncoder object. The transform() method takes the input column of categorical values as its input argument and returns a numpy array containing a numeric label for each value in the input. 

You can observe this in the following example.

from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
trained_encoder_object=untrained_encoder_object.fit(cities)
encoded_values=trained_encoder_object.transform(cities)
print("The label encoded values are:")
print(encoded_values)

Output:

The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]

In the output, you can observe that the categorical values have been assigned numerical labels in alphabetical order. Hence, London is assigned the value 0, Mumbai is assigned the value 1, New York has the value 3 and Paris is assigned the value 4.

Instead of using fit() and transform() methods separately, you can also use the fit_transform() method on the untrained LabelEncoder object to perform label encoding as shown below.

from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
encoded_values=untrained_encoder_object.fit_transform(cities)
print("The label encoded values are:")
print(encoded_values)

Output:

The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]

In this example, we have directly generated label encoding using the fit_transform() method.

Generate Categorical Values From Label Encoded Data

You can also extract the original categorical data from the label-encoded values. For this, you can use the inverse_transform() method. The inverse_transform() method, when invoked on a trained LabelEncoder object, takes a list of numeric values as its input. After execution, it returns the original categorical values corresponding to the numeric values. You can observe this in the following example.

from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
trained_encoder_object=untrained_encoder_object.fit(cities)
encoded_values=trained_encoder_object.transform(cities)
print("The label encoded values are:")
print(encoded_values)
new_codes=[1,1,1,1,2,0,1]
print("The input coded values are:")
print(new_codes)
original_values=trained_encoder_object.inverse_transform(new_codes)
print("The original values corresposding to the codes are:")
print(original_values)

Output:

The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]
The input coded values are:
[1, 1, 1, 1, 2, 0, 1]
The original values corresposding to the codes are:
['Mumbai' 'Mumbai' 'Mumbai' 'Mumbai' 'New York' 'London' 'Mumbai']

In this example, we first trained the LabelEncoder object. After this, when we pass numeric values to the inverse_transform() method, it returns a list of original values that we used while training the encoder.

You can also find all the unique categorical values in the input data using the classes_ attribute of the trained LabelEncoder object as shown in the following example.

from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
trained_encoder_object=untrained_encoder_object.fit(cities)
encoded_values=trained_encoder_object.transform(cities)
print("The label encoded values are:")
print(encoded_values)
print("The unique categorical values in the input are:")
print(trained_encoder_object.classes_)

Output:

The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]
The unique categorical values in the input are:
['London' 'Mumbai' 'New York' 'Paris']

Normally, we use label encoding on the column of a dataframe in Python. To perform label encoding on a dataframe column, we will first generate label-encoded values by passing the column as input to the fit_transform() method. Then, we will assign the encoded values to the column in the dataframe as shown below.

import pandas as pd
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
print(df)
from sklearn import preprocessing
untrained_encoder_object = preprocessing.LabelEncoder()
encoded_values=untrained_encoder_object.fit_transform(df["City"])
df["City"]=encoded_values
print("The output dataframe is:")
print(df)

Output:

The dataframe is:
            Name      City
0     John Smith  New York
1     Aditya Raj    Mumbai
2     Will Smith    London
3    Harsh Aryan    London
4  Joel Harrison    Mumbai
5    Bill Warner     Paris
6     Chris Kite  New York
7     Sam Altman    London
8            Joe    London
The output dataframe is:
            Name  City
0     John Smith     2
1     Aditya Raj     1
2     Will Smith     0
3    Harsh Aryan     0
4  Joel Harrison     1
5    Bill Warner     3
6     Chris Kite     2
7     Sam Altman     0
8            Joe     0

In the above example, the fit_transform() method returns a numpy array of numeric labels. When we assign the array to the dataframe column, the categorical values are replaced with numeric values.

Implement Label Encoding in PySpark

We don’t have a dedicated function to implement label encoding in pyspark. However, we can use the StringIndexer() function to perform label encoding using the following steps.

  • First, we will create a StringIndexer object using the StringIndexer() function. The StringIndexer() function takes the name of the column that we want to encode as its input argument for the inputCol parameter. It also takes the name of the new column to be created using the encoded values in its outputCol parameter. Here, we will pass “City” as input to the inputCol parameter and “City_label” as input to the outputCol parameter. 
  • Next, we will train the StringIndexer object using the fit() method. The fit() method takes the dataframe as its input and returns a trained StringIndexer object. 
  • Next, we will use the transform() method to perform label encoding. For this, we will invoke the transform() method on the StringIndexer object and pass the dataframe as its input. After this, we will get label-encoded values in our new column.
  • Finally, we will drop the original City column using the drop() method and rename the City_label column to City using the withColumnRenamed() method.

After executing the above steps, we will get the output data frame with label-encoded values as shown below.

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("label_encoding_example") \
      .getOrCreate() 
dfs=spark.read.csv("sample_file .csv",header=True)
print("The input dataframe is:")
dfs.show()
indexer = StringIndexer(inputCol="City", outputCol="City_label") 
indexed_df = indexer.fit(dfs).transform(dfs)
indexed_df=indexed_df.drop("City").withColumnRenamed("City_label","City")
print("The output dataframe is:")
indexed_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------------+--------+
|         Name|    City|
+-------------+--------+
|   John Smith|New York|
|   Aditya Raj|  Mumbai|
|   Will Smith|  London|
|  Harsh Aryan|  London|
|Joel Harrison|  Mumbai|
|  Bill Warner|   Paris|
|   Chris Kite|New York|
|   Sam Altman|  London|
|          Joe|  London|
+-------------+--------+

The output dataframe is:
+-------------+----+
|         Name|City|
+-------------+----+
|   John Smith| 2.0|
|   Aditya Raj| 1.0|
|   Will Smith| 0.0|
|  Harsh Aryan| 0.0|
|Joel Harrison| 1.0|
|  Bill Warner| 3.0|
|   Chris Kite| 2.0|
|   Sam Altman| 0.0|
|          Joe| 0.0|
+-------------+----+

Conclusion

In this article, we have discussed how to implement label encoding in Python using the sklearn module. We also discussed how to implement label encoding in PySpark. To learn more about machine learning topics, you can read this article on how to implement the fp-growth algorithm in Python. You might also like this article on the ECLAT algorithm numerical example.

I hope you enjoyed reading this article. Stay tuned for more informative articles. 

Happy learning!

Similar Posts