One Hot Encoding in Python
We use different categorical data encoding techniques while data analysis and machine learning tasks. In this article, we will discuss the basics of one hot encoding. We will also discuss implementing one hot encoding in Python.
What is One Hot Encoding?
One hot encoding is an encoding technique in which we represent categorical values with numeric arrays of 0s and 1s. In one hot encoding, we use the following steps to encode categorical variables.
- First, we find the number of unique values for a given categorical variable. The length of the array containing one-hot encoded values is equal to the total number of unique values for a given categorical variable.
- Next, we assign an index in the array to each unique value.
- For the one-hot array to represent a categorical value, we set the value in the array to 1 at the index associated with the categorical value. The rest of the values in the array remain 0.
- We create a one-hot encoded array for each value in the categorical variable and assign them to the values.
One Hot Encoding Numerical Example
To understand how the above algorithm works, let us discuss a numerical example of one hot encoding. For this, we will use the following dataset.
Name | City |
---|---|
John Smith | New York |
Aditya Raj | Mumbai |
Will Smith | London |
Harsh Aryan | London |
Joel Harrison | Mumbai |
Bill Warner | Paris |
Chris Kite | New York |
Sam Altman | London |
Joe | London |
In the above table, suppose that we want to perform one-hot encoding on the City
column. For this, we will use the following steps.
- First, we will find the unique values in the given column. As there are four unique values
London
,Mumbai
,New York
, andParis
, the value will be 4. - Next, we will create an array of length 4 with all 0s for each unique categorical value. So, the one-hot encoded arrays right now are as follows.
- London=[0,0,0,0]
- Mumbai=[0,0,0,0]
- New York=[0,0,0,0]
- Paris=[0,0,0,0]
- After this, we will decide on the index associated with each categorical value in the array. Let us assign index 0 to
London
, 1 toMumbai
, 2 toNew York
, and 3 toParis
. - Next, we will set the element at the associated index of each categorical value to 1 in the one-hot encoded array. Hence, the one-hot encoded arrays will look as follows.
- London=[1, 0, 0, 0]
- Mumbai=[0, 1, 0, 0]
- New York=[0, 0, 1, 0]
- Paris=[0, 0, 0, 1]
The above one-hot encoded arrays represent the associated categorical value. For example, the array [1, 0, 0, 0] represents the value London
, [0, 1, 0, 0] represents the value Mumbai
, and so on.
In most cases, these one-hot encoded arrays are split into different columns in the dataset. Here, each column represents a unique categorical value as shown below.
Name | City | City_London | City_Mumbai | City_New York | City_Paris |
---|---|---|---|---|---|
John Smith | New York | 0 | 0 | 1 | 0 |
Aditya Raj | Mumbai | 0 | 1 | 0 | 0 |
Will Smith | London | 1 | 0 | 0 | 0 |
Harsh Aryan | London | 1 | 0 | 0 | 0 |
Joel Harrison | Mumbai | 0 | 1 | 0 | 0 |
Bill Warner | Paris | 0 | 0 | 0 | 1 |
Chris Kite | New York | 0 | 0 | 1 | 0 |
Sam Altman | London | 1 | 0 | 0 | 0 |
Joe | London | 1 | 0 | 0 | 0 |
In the above table, you can observe that we have split the one-hot encoded arrays into columns. In the new columns, the value is set to 1 if the row represents a particular value. Otherwise, it is set to 0. For instance, the City_London
column for the rows in which City
is London
is set to 1 and all the other columns are 0.
One Hot Encoding in Python Using The sklearn Module
Now, that we have discussed how to perform one hot encoding, we will implement it in Python. For this, we will use the OneHotEncoder()
function defined in sklearn.preprocessing
module.
The OneHotEncoder() Function
The OneHotEncoder()
function has the following syntax.
OneHotEncoder(*, categories='auto', drop=None, sparse='deprecated', sparse_output=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None)
Here,
- The
categories
parameter is used to specify the unique values in the input data. By default, it is set to“auto”
. Hence, the encoder finds all the unique values themselves. If you want to specify the unique values manually, you can pass a list of all the unique values in the categorical data to thecategories
parameter as its input. The passed values should not mix strings and numeric values within a single feature and should be sorted in the case of numeric values. - We use the
drop
parameter to reduce the length of one-hot encoded vectors. From the input data, we can represent one unique value with a vector containing all 0s. By this, we can reduce the size of the one-hot encoded vector by one. By default, thedrop
parameter is set toNone
. Hence, all the values are retained.- You can set the
drop
parameter to‘first’
to drop the first categorical value. The first categorical value will then be represented by a vector containing all zeros. If only one category is present, the value will be dropped entirely. - If we set the
drop
parameter to‘if_binary’
, the encoder drops the first value in the case of binary variables. Features with 1 or more than 2 categories are left intact.
- You can set the
- The
sparse
parameter has been deprecated and will be removed in the next versions of sklearn. When we set thesparse
parameter to True, the one-hot encoded values are generated in the form of a sparse matrix. Otherwise, we get an array. Thesparse_output
parameter is the new name for thesparse
parameter. - The
dtype
parameter is used to specify the desired data type of the output. By default, it is set to float64. You can change it to any number type such as int32, int64, etc. - The
min_frequency
parameter is used to specify the minimum frequency below which a category will be considered infrequent. You can pass an integer as the absolute support count or a floating point number to specify the minimum support to decide the infrequent values. - The
max_categories
parameter specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, themax_categories
parameter includes the category representing the infrequent categories along with the frequent categories. If we set themax_categories
parameter to None, there is no limit to the number of output features. - The
handle_unknown
parameter is used to handle unknown values while generating one-hot encoding using thetransform()
method.- By default, the
handle_unknown
parameter is set toerror
. Hence, if the data given to thetransform()
method contains new values compared to the data given to thefit() method, the program runs into an
error. - You can set the
handle_unknown
parameter to“ignore”
. After this, if an unknown value is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None. - You can also set the one-hot encoded values of new values to existing infrequent values. For this, you can set the
handle_unknown
parameter to ‘infrequent_if_exist
’. After this, if an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted ‘infrequent
‘ if it exists. - If the ‘
infrequent
‘ category does not exist, thentransform()
method andinverse_transform()
method will handle an unknown category as withhandle_unknown='ignore'
. Infrequent categories exist based onmin_frequency
andmax_categories
.
- By default, the
After execution, the OneHotEncoder()
function returns an untrained one-hot encoder created using the sklearn module in Python. We can then train the encoder using the fit()
method. If we want to encode values from a single attribute, the fit()
method takes a numpy array of shapes (-1, 1). After execution, it returns a trained OneHotEncoder
object.
We can use the transform()
method to predict one hot encoded value using the trained OneHotEncoder
object. The transform()
method takes the array containing the values for which we need to predict encoded values and returns a sparse array. You can convert the sparse array to one-hot encoded array using the toarray()
method as shown below.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
untrained_encoder = OneHotEncoder(handle_unknown='ignore')
cities=np.array(["New York", "Mumbai", "London", "Paris"]).reshape(-1, 1)
print("The training set is:")
print(cities)
trained_encoder=untrained_encoder.fit(cities)
input_values=np.array(["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]).reshape(-1, 1)
output=trained_encoder.transform(input_values).toarray()
print("The input values are:")
print(input_values)
print("The output is:")
print(output)
Output:
The training set is:
[['New York']
['Mumbai']
['London']
['Paris']]
The input values are:
[['New York']
['Mumbai']
['London']
['London']
['Mumbai']
['Paris']
['New York']
['London']
['London']]
The output is:
[[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]]
In the above example, we have trained an one hot encoder using four values. Then, we passed a list of values to predict the one hot encoded arrays. In the output, you can observe that the arrays are in the same format we discussed in the numerical example.
You can also perform one hot encoding on multiple features using a single OneHotEncoder
object. For this, you can simply pass the 2-D list containing all the rows and columns as input to the fit()
method as shown below.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
untrained_encoder = OneHotEncoder(handle_unknown='ignore')
cities=[[0,"Mumbai"],[1,"London"],[2,"Paris"],[3,"New York"]]
print("The training set is:")
print(cities)
trained_encoder=untrained_encoder.fit(cities)
print(trained_encoder.categories_)
input_values=[[1,"Mumbai"],[2,"New York"]]
output=trained_encoder.transform(input_values).toarray()
print("The input values are:")
print(input_values)
print("The output is:")
print(output)
Output:
The training set is:
[[0, 'Mumbai'], [1, 'London'], [2, 'Paris'], [3, 'New York']]
[array([0, 1, 2, 3], dtype=object), array(['London', 'Mumbai', 'New York', 'Paris'], dtype=object)]
The input values are:
[[1, 'Mumbai'], [2, 'New York']]
The output is:
[[0. 1. 0. 0. 0. 1. 0. 0.]
[0. 0. 1. 0. 0. 0. 1. 0.]]
In this example, we passed a two-dimensional array to the fit()
method to perform one hot encoding in Python. Here, the first element of each internal array is considered to belong to a single feature and the second element of each internal array belongs to another feature.
The number of elements in the output array depends on the unique value in both features. As there are 4 unique values in the first feature and four unique values in the second feature, the one hot encoded arrays contain eight elements.
One Hot Encoding on a Pandas DataFrame in Python
In the previous examples, we discussed how to perform one hot encoding on 1-d and 2-d arrays containing standalone values. This is of least use to us as we handle most of the data using pandas dataframes while creating machine learning applications. Hence, let us discuss how to perform one hot encoding on a pandas dataframe in Python.
The process to train the one hot encoder is the same as discussed in the previous examples. We can extract a column from the dataframe and train the one hot encoder using the fit()
method. After creating the encoder, we need to create a column transformer to generate one-hot encoded columns in the output dataframe. For this, we will use the make_column_transformer()
function.
The make_column_transformer() Function
The make_column_transformer()
function has the following syntax.
make_column_transformer(*transformers, remainder='drop', sparse_threshold=0.3, n_jobs=None, verbose=False, verbose_feature_names_out=True)
Here,
- The
transformers
parameter takes a tuple containing the trainedOneHotEncoder
object and a list of column names on which we want to perform one hot encoding. - By default, the
remainder
parameter is set to ‘drop
’. Hence, only the specified columns in thetransformers
parameter are encoded and produced in the output. If we don’t specify a column name in thetransformers
parameter, they are dropped from the output. To avoid this, we can set theremainder
parameter to ‘passthrough
’. After this, all remaining columns that are not specified in thetransformers
parameter will be automatically passed through and included in the output. This subset of columns is concatenated with the output of the encoders. - You can also pass an untrained
OneHotEncoder
to theremainder
parameter. By setting theremainder
parameter to be an encoder, the columns that are not specified in thetransformers
parameter are encoded using the remainder estimator. Here, the encoder that we pass to theremainder
parameter must supportfit()
andtransform()
methods. - If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. We can set the
sparse_threshold
parameter to 0 to always return dense data. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and thesparse_threshold
parameter is ignored. - The
n_jobs
parameter is used to run the one hot encoder in parallel. By default, it is set to None. It means that only one job will run. You can set it to -1 to run jobs as many as the number of processors in your machine. - The
verbose
parameter is used to print the time elapsed while fitting each encoder. By default, it is set to False. - The
get_feature_names_out
parameter is used to prefix all feature names with the name of the transformer that generated that feature in the one-hot encoded dataframe. By default, it is set to True. If we set it to False,get_feature_names_out
will not prefix any feature names and the program will run into an error if the feature names are not unique.
To perform one-hot encoding on the columns of a pandas dataframe, we can create a transformer using the make_column_transformer()
function. Then, we will invoke the fit()
method on the transformer and pass the input dataframe to the fit()
method. After this, we can use the transform()
method to generate the array containing the one hot encoded values.
To convert the array into a dataframe, we will use the DataFrame()
function defined in the pandas module. We will also use the get_feature_names_out()
method on the trained transformer to get the column names for the one-hot encoded data. You can observe this in the following example.
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
print(df)
values=np.array(df["City"]).reshape(-1,1)
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City"]), remainder='passthrough')
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)
Output:
The dataframe is:
Name City
0 John Smith New York
1 Aditya Raj Mumbai
2 Will Smith London
3 Harsh Aryan London
4 Joel Harrison Mumbai
5 Bill Warner Paris
6 Chris Kite New York
7 Sam Altman London
8 Joe London
The output dataframe is:
onehotencoder__City_London onehotencoder__City_Mumbai \
0 0.0 0.0
1 0.0 1.0
2 1.0 0.0
3 1.0 0.0
4 0.0 1.0
5 0.0 0.0
6 0.0 0.0
7 1.0 0.0
8 1.0 0.0
onehotencoder__City_New York onehotencoder__City_Paris remainder__Name
0 1.0 0.0 John Smith
1 0.0 0.0 Aditya Raj
2 0.0 0.0 Will Smith
3 0.0 0.0 Harsh Aryan
4 0.0 0.0 Joel Harrison
5 0.0 1.0 Bill Warner
6 1.0 0.0 Chris Kite
7 0.0 0.0 Sam Altman
8 0.0 0.0 Joe
In the above example, we have encoded the City
column of the input dataframe using one-hot encoding in Python. For this, we first trained the OneHotEncoder
using the column data and then we transformed the input dataframe using the make_column_transformer()
function. Here, we passed a tuple containing the trained OneHotEncoder
object and a list of column names as the first input argument to the make_column_transformer() function. In the above output, you can observe that the column names in the output dataframe look dirty as they all contain the transformer names.
You can set the get_feature_names_out
parameter to False to generate clean output column names as shown below.
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
print(df)
values=np.array(df["City"]).reshape(-1,1)
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City"]), remainder='passthrough',verbose_feature_names_out=False)
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)
Output:
The dataframe is:
Name City
0 John Smith New York
1 Aditya Raj Mumbai
2 Will Smith London
3 Harsh Aryan London
4 Joel Harrison Mumbai
5 Bill Warner Paris
6 Chris Kite New York
7 Sam Altman London
8 Joe London
The output dataframe is:
City_London City_Mumbai City_New York City_Paris Name
0 0.0 0.0 1.0 0.0 John Smith
1 0.0 1.0 0.0 0.0 Aditya Raj
2 1.0 0.0 0.0 0.0 Will Smith
3 1.0 0.0 0.0 0.0 Harsh Aryan
4 0.0 1.0 0.0 0.0 Joel Harrison
5 0.0 0.0 0.0 1.0 Bill Warner
6 0.0 0.0 1.0 0.0 Chris Kite
7 1.0 0.0 0.0 0.0 Sam Altman
8 1.0 0.0 0.0 0.0 Joe
In this example, we have set the verbose_feature_names_out
parameter to False in the make_column_transformer()
function. Hence, we get the output dataframe with the desired column names.
One Hot Encoding With Multiple Columns of the Pandas Dataframe
To perform one hot encoding on multiple columns in the pandas dataframe at once, we will first obtain values from all the columns and train the one hot encoder. Then, we will pass multiple column names in the list of column names passed to the transformers
parameter in the make_column_transformer()
function. After this, we can train the column transformer and perform one hot encoding on multiple columns in the pandas dataframe as shown below.
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
df["Grades"]=["A","C", "B", "A", "A","B","B","C","D"]
print(df)
values=df[["City", "Grades"]]
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City", "Grades"]), remainder='passthrough',verbose_feature_names_out=False)
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)
Output:
The dataframe is:
Name City Grades
0 John Smith New York A
1 Aditya Raj Mumbai C
2 Will Smith London B
3 Harsh Aryan London A
4 Joel Harrison Mumbai A
5 Bill Warner Paris B
6 Chris Kite New York B
7 Sam Altman London C
8 Joe London D
The output dataframe is:
City_London City_Mumbai City_New York City_Paris Grades_A Grades_B Grades_C \
0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
1 0.0 1.0 0.0 0.0 0.0 0.0 1.0
2 1.0 0.0 0.0 0.0 0.0 1.0 0.0
3 1.0 0.0 0.0 0.0 1.0 0.0 0.0
4 0.0 1.0 0.0 0.0 1.0 0.0 0.0
5 0.0 0.0 0.0 1.0 0.0 1.0 0.0
6 0.0 0.0 1.0 0.0 0.0 1.0 0.0
7 1.0 0.0 0.0 0.0 0.0 0.0 1.0
8 1.0 0.0 0.0 0.0 0.0 0.0 0.0
Grades_D Name
0 0.0 John Smith
1 0.0 Aditya Raj
2 0.0 Will Smith
3 0.0 Harsh Aryan
4 0.0 Joel Harrison
5 0.0 Bill Warner
6 0.0 Chris Kite
7 0.0 Sam Altman
8 1.0 Joe
In the above output, you can observe that the City
Column and Grade
columns are encoded using one-hot encoding in python in a single execution.
Conclusion
In this article, we discussed one hot encoding in Python. We also discussed different implementations of the one hot encoding process using the sklearn module. To learn more about encoding techniques, you can read this article on label encoding in Python. You might also like this article on k-means clustering in Python.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!