Implement Label Encoding in Python and PySpark
To analyze categorical data, we often need to convert them into numerical values. Label encoding is one of the most straightforward data preprocessing techniques for encoding categorical data into numeric values. This article will discuss different ways to perform label encoding in Python and pyspark.
How to Perform Label Encoding?
To perform label encoding, we just need to assign unique numeric values to each definite value in the dataset. For instance, consider the following dataset.
Name | City |
John Smith | New York |
Aditya Raj | Mumbai |
Will Smith | London |
Harsh Aryan | London |
Joel Harrison | Mumbai |
Bill Warner | Paris |
Chris Kite | New York |
Sam Altman | London |
Joe | London |
Now, if we have to perform label encoding on the City
column in the above table, we will assign a unique numeric value to each city name. For example, we can assign the value 0 to New York, 1 to Mumbai, 2 to London, and 3 to Paris. After this, we will replace the City names with the numeric values as shown below.
Name | City |
John Smith | 0 |
Aditya Raj | 1 |
Will Smith | 2 |
Harsh Aryan | 2 |
Joel Harrison | 1 |
Bill Warner | 3 |
Chris Kite | 0 |
Sam Altman | 2 |
Joe | 2 |
Thus, we have assigned numeric labels to each city name using label encoding. Now, let us discuss different ways to perform label encoding in Python.
Label Encoding in Python Using the Sklearn Module
The sklearn module provides us with the LabelEncoder()
function to perform label encoding in Python. To perform label encoding using the sklearn module in Python, we will use the following steps.
- First, we will create an empty
LabelEncoder
object by executing theLabelEncoder()
function. - Then, we will train the
LabelEncoder
object using thefit()
method. Thefit()
method takes the list containing categorical values and learns all the unique values. After execution, it returns a trainedLabelEncoder
object. - Next, we can perform label encoding by invoking the
transform()
method on the trainedLabelEncoder
object. Thetransform()
method takes the input column of categorical values as its input argument and returns a numpy array containing a numeric label for each value in the input.
You can observe this in the following example.
from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
trained_encoder_object=untrained_encoder_object.fit(cities)
encoded_values=trained_encoder_object.transform(cities)
print("The label encoded values are:")
print(encoded_values)
Output:
The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]
In the output, you can observe that the categorical values have been assigned numerical labels in alphabetical order. Hence, London is assigned the value 0, Mumbai is assigned the value 1, New York has the value 3 and Paris is assigned the value 4.
Instead of using fit()
and transform()
methods separately, you can also use the fit_transform()
method on the untrained LabelEncoder
object to perform label encoding as shown below.
from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
encoded_values=untrained_encoder_object.fit_transform(cities)
print("The label encoded values are:")
print(encoded_values)
Output:
The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]
In this example, we have directly generated label encoding using the fit_transform()
method.
Generate Categorical Values From Label Encoded Data
You can also extract the original categorical data from the label-encoded values. For this, you can use the inverse_transform()
method. The inverse_transform()
method, when invoked on a trained LabelEncoder
object, takes a list of numeric values as its input. After execution, it returns the original categorical values corresponding to the numeric values. You can observe this in the following example.
from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
trained_encoder_object=untrained_encoder_object.fit(cities)
encoded_values=trained_encoder_object.transform(cities)
print("The label encoded values are:")
print(encoded_values)
new_codes=[1,1,1,1,2,0,1]
print("The input coded values are:")
print(new_codes)
original_values=trained_encoder_object.inverse_transform(new_codes)
print("The original values corresposding to the codes are:")
print(original_values)
Output:
The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]
The input coded values are:
[1, 1, 1, 1, 2, 0, 1]
The original values corresposding to the codes are:
['Mumbai' 'Mumbai' 'Mumbai' 'Mumbai' 'New York' 'London' 'Mumbai']
In this example, we first trained the LabelEncoder
object. After this, when we pass numeric values to the inverse_transform()
method, it returns a list of original values that we used while training the encoder.
You can also find all the unique categorical values in the input data using the classes_
attribute of the trained LabelEncoder
object as shown in the following example.
from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
trained_encoder_object=untrained_encoder_object.fit(cities)
encoded_values=trained_encoder_object.transform(cities)
print("The label encoded values are:")
print(encoded_values)
print("The unique categorical values in the input are:")
print(trained_encoder_object.classes_)
Output:
The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]
The unique categorical values in the input are:
['London' 'Mumbai' 'New York' 'Paris']
Normally, we use label encoding on the column of a dataframe in Python. To perform label encoding on a dataframe column, we will first generate label-encoded values by passing the column as input to the fit_transform()
method. Then, we will assign the encoded values to the column in the dataframe as shown below.
import pandas as pd
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
print(df)
from sklearn import preprocessing
untrained_encoder_object = preprocessing.LabelEncoder()
encoded_values=untrained_encoder_object.fit_transform(df["City"])
df["City"]=encoded_values
print("The output dataframe is:")
print(df)
Output:
The dataframe is:
Name City
0 John Smith New York
1 Aditya Raj Mumbai
2 Will Smith London
3 Harsh Aryan London
4 Joel Harrison Mumbai
5 Bill Warner Paris
6 Chris Kite New York
7 Sam Altman London
8 Joe London
The output dataframe is:
Name City
0 John Smith 2
1 Aditya Raj 1
2 Will Smith 0
3 Harsh Aryan 0
4 Joel Harrison 1
5 Bill Warner 3
6 Chris Kite 2
7 Sam Altman 0
8 Joe 0
In the above example, the fit_transform()
method returns a numpy array of numeric labels. When we assign the array to the dataframe column, the categorical values are replaced with numeric values.
Implement Label Encoding in PySpark
We don’t have a dedicated function to implement label encoding in pyspark. However, we can use the StringIndexer()
function to perform label encoding using the following steps.
- First, we will create a
StringIndexer
object using theStringIndexer()
function. TheStringIndexer()
function takes the name of the column that we want to encode as its input argument for theinputCol
parameter. It also takes the name of the new column to be created using the encoded values in itsoutputCol
parameter. Here, we will pass“City”
as input to theinputCol
parameter and“City_label”
as input to theoutputCol
parameter. - Next, we will train the
StringIndexer
object using thefit()
method. Thefit()
method takes the dataframe as its input and returns a trainedStringIndexer
object. - Next, we will use the
transform()
method to perform label encoding. For this, we will invoke thetransform()
method on theStringIndexer
object and pass the dataframe as its input. After this, we will get label-encoded values in our new column. - Finally, we will drop the original
City
column using thedrop()
method and rename theCity_label
column toCity
using thewithColumnRenamed()
method.
After executing the above steps, we will get the output data frame with label-encoded values as shown below.
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
spark = SparkSession.builder \
.master("local[1]") \
.appName("label_encoding_example") \
.getOrCreate()
dfs=spark.read.csv("sample_file .csv",header=True)
print("The input dataframe is:")
dfs.show()
indexer = StringIndexer(inputCol="City", outputCol="City_label")
indexed_df = indexer.fit(dfs).transform(dfs)
indexed_df=indexed_df.drop("City").withColumnRenamed("City_label","City")
print("The output dataframe is:")
indexed_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------------+--------+
| Name| City|
+-------------+--------+
| John Smith|New York|
| Aditya Raj| Mumbai|
| Will Smith| London|
| Harsh Aryan| London|
|Joel Harrison| Mumbai|
| Bill Warner| Paris|
| Chris Kite|New York|
| Sam Altman| London|
| Joe| London|
+-------------+--------+
The output dataframe is:
+-------------+----+
| Name|City|
+-------------+----+
| John Smith| 2.0|
| Aditya Raj| 1.0|
| Will Smith| 0.0|
| Harsh Aryan| 0.0|
|Joel Harrison| 1.0|
| Bill Warner| 3.0|
| Chris Kite| 2.0|
| Sam Altman| 0.0|
| Joe| 0.0|
+-------------+----+
Conclusion
In this article, we have discussed how to implement label encoding in Python using the sklearn module. We also discussed how to implement label encoding in PySpark. To learn more about machine learning topics, you can read this article on how to implement the fp-growth algorithm in Python. You might also like this article on the ECLAT algorithm numerical example.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy learning!