One Hot Encoding in Python

We use different categorical data encoding techniques while data analysis and machine learning tasks. In this article, we will discuss the basics of one hot encoding. We will also discuss implementing one hot encoding in Python.

What is One Hot Encoding?

One hot encoding is an encoding technique in which we represent categorical values with numeric arrays of 0s and 1s. In one hot encoding, we use the following steps to encode categorical variables.

  • First, we find the number of unique values for a given categorical variable. The length of the array containing one-hot encoded values is equal to the total number of unique values for a given categorical variable.
  • Next, we assign an index in the array to each unique value.
  • For the one-hot array to represent a categorical value, we set the value in the array to 1 at the index associated with the categorical value. The rest of the values in the array remain 0.
  • We create a one-hot encoded array for each value in the categorical variable and assign them to the values.

One Hot Encoding Numerical Example

To understand how the above algorithm works, let us discuss a numerical example of one hot encoding. For this, we will use the following dataset.

NameCity
John SmithNew York
Aditya RajMumbai
Will SmithLondon
Harsh AryanLondon
Joel HarrisonMumbai
Bill WarnerParis
Chris KiteNew York
Sam AltmanLondon
JoeLondon
Dataset for one hot encoding

In the above table, suppose that we want to perform one-hot encoding on the City column. For this, we will use the following steps.

  • First, we will find the unique values in the given column. As there are four unique values London, Mumbai, New York, and Paris, the value will be 4. 
  • Next, we will create an array of length 4 with all 0s for each unique categorical value. So, the one-hot encoded arrays right now are as follows.
    • London=[0,0,0,0]
    • Mumbai=[0,0,0,0]
    • New York=[0,0,0,0]
    • Paris=[0,0,0,0]
  • After this, we will decide on the index associated with each categorical value in the array. Let us assign index 0 to London, 1 to Mumbai, 2 to New York, and 3 to Paris
  • Next, we will set the element at the associated index of each categorical value to 1 in the one-hot encoded array. Hence, the one-hot encoded arrays will look as follows.
    • London=[1, 0, 0, 0]
    • Mumbai=[0, 1, 0, 0]
    • New York=[0, 0, 1, 0]
    • Paris=[0, 0, 0, 1]

The above one-hot encoded arrays represent the associated categorical value. For example, the array [1, 0, 0, 0] represents the value London, [0, 1, 0, 0] represents the value Mumbai, and so on.

In most cases, these one-hot encoded arrays are split into different columns in the dataset. Here, each column represents a unique categorical value as shown below.

NameCityCity_LondonCity_MumbaiCity_New YorkCity_Paris
John SmithNew York0010
Aditya RajMumbai0100
Will SmithLondon1000
Harsh AryanLondon1000
Joel HarrisonMumbai0100
Bill WarnerParis0001
Chris KiteNew York0010
Sam AltmanLondon1000
JoeLondon1000
One Hot encoded data

In the above table, you can observe that we have split the one-hot encoded arrays into columns. In the new columns, the value is set to 1 if the row represents a particular value. Otherwise, it is set to 0. For instance, the City_London column for the rows in which City is London is set to 1 and all the other columns are 0.

One Hot Encoding in Python Using The sklearn Module

Now, that we have discussed how to perform one hot encoding, we will implement it in Python. For this, we will use the OneHotEncoder() function defined in sklearn.preprocessing module.

The OneHotEncoder() Function

The OneHotEncoder() function has the following syntax.

OneHotEncoder(*, categories='auto', drop=None, sparse='deprecated', sparse_output=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None)

Here, 

  • The categories parameter is used to specify the unique values in the input data. By default, it is set to “auto”. Hence, the encoder finds all the unique values themselves. If you want to specify the unique values manually, you can pass a list of all the unique values in the categorical data to the categories parameter as its input. The passed values should not mix strings and numeric values within a single feature and should be sorted in the case of numeric values.
  • We use the drop parameter to reduce the length of one-hot encoded vectors. From the input data, we can represent one unique value with a vector containing all 0s. By this, we can reduce the size of the one-hot encoded vector by one. By default, the drop parameter is set to None. Hence, all the values are retained.
    • You can set the drop parameter to ‘first’ to drop the first categorical value. The first categorical value will then be represented by a vector containing all zeros. If only one category is present, the value will be dropped entirely.
    •  If we set the drop parameter to ‘if_binary’, the encoder drops the first value in the case of binary variables. Features with 1 or more than 2 categories are left intact.
  • The sparse parameter has been deprecated and will be removed in the next versions of sklearn. When we set the sparse parameter to True, the one-hot encoded values are generated in the form of a sparse matrix. Otherwise, we get an array. The sparse_output parameter is the new name for the sparse parameter.
  • The dtype parameter is used to specify the desired data type of the output. By default, it is set to float64. You can change it to any number type such as int32, int64, etc.
  • The min_frequency parameter is used to specify the minimum frequency below which a category will be considered infrequent. You can pass an integer as the absolute support count or a floating point number to specify the minimum support to decide the infrequent values.
  • The max_categories parameter specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, the max_categories parameter includes the category representing the infrequent categories along with the frequent categories. If we set the max_categories parameter to None, there is no limit to the number of output features.
  • The handle_unknown parameter is used to handle unknown values while generating one-hot encoding using the transform() method.
    • By default, the handle_unknown parameter is set to error. Hence, if the data given to the transform() method contains new values compared to the data given to the fit() method, the program runs into an error.
    • You can set the handle_unknown parameter to “ignore”. After this, if an unknown value is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
    • You can also set the one-hot encoded values of new values to existing infrequent values.  For this, you can set the handle_unknown parameter to ‘infrequent_if_exist’. After this, if an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted ‘infrequent‘ if it exists.
    • If the ‘infrequent‘ category does not exist, then transform() method and inverse_transform() method will handle an unknown category as with handle_unknown='ignore'. Infrequent categories exist based on min_frequency and max_categories

After execution, the OneHotEncoder() function returns an untrained one-hot encoder created using the sklearn module in Python. We can then train the encoder using the fit() method. If we want to encode values from a single attribute, the fit() method takes a numpy array of shapes (-1, 1). After execution, it returns a trained OneHotEncoder object. 

We can use the transform() method to predict one hot encoded value using the trained OneHotEncoder object. The transform() method takes the array containing the values for which we need to predict encoded values and returns a sparse array. You can convert the sparse array to one-hot encoded array using the toarray() method as shown below.

from sklearn.preprocessing import OneHotEncoder
import numpy as np
untrained_encoder = OneHotEncoder(handle_unknown='ignore')
cities=np.array(["New York", "Mumbai", "London", "Paris"]).reshape(-1, 1)
print("The training set is:")
print(cities)
trained_encoder=untrained_encoder.fit(cities)
input_values=np.array(["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]).reshape(-1, 1)
output=trained_encoder.transform(input_values).toarray()
print("The input values are:")
print(input_values)
print("The output is:")
print(output)

Output:

The training set is:
[['New York']
 ['Mumbai']
 ['London']
 ['Paris']]
The input values are:
[['New York']
 ['Mumbai']
 ['London']
 ['London']
 ['Mumbai']
 ['Paris']
 ['New York']
 ['London']
 ['London']]
The output is:
[[0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]]

In the above example, we have trained an one hot encoder using four values. Then, we passed a list of values to predict the one hot encoded arrays. In the output, you can observe that the arrays are in the same format we discussed in the numerical example.

You can also perform one hot encoding on multiple features using a single OneHotEncoder object. For this, you can simply pass the 2-D list containing all the rows and columns as input to the fit() method as shown below.

from sklearn.preprocessing import OneHotEncoder
import numpy as np
untrained_encoder = OneHotEncoder(handle_unknown='ignore')
cities=[[0,"Mumbai"],[1,"London"],[2,"Paris"],[3,"New York"]]
print("The training set is:")
print(cities)
trained_encoder=untrained_encoder.fit(cities)
print(trained_encoder.categories_)
input_values=[[1,"Mumbai"],[2,"New York"]]
output=trained_encoder.transform(input_values).toarray()
print("The input values are:")
print(input_values)
print("The output is:")
print(output)

Output:

The training set is:
[[0, 'Mumbai'], [1, 'London'], [2, 'Paris'], [3, 'New York']]
[array([0, 1, 2, 3], dtype=object), array(['London', 'Mumbai', 'New York', 'Paris'], dtype=object)]
The input values are:
[[1, 'Mumbai'], [2, 'New York']]
The output is:
[[0. 1. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 1. 0.]]

In this example, we passed a two-dimensional array to the fit() method to perform one hot encoding in Python. Here, the first element of each internal array is considered to belong to a single feature and the second element of each internal array belongs to another feature.

The number of elements in the output array depends on the unique value in both features. As there are 4 unique values in the first feature and four unique values in the second feature, the one hot encoded arrays contain eight elements.

One Hot Encoding on a Pandas DataFrame in Python

In the previous examples, we discussed how to perform one hot encoding on 1-d and 2-d arrays containing standalone values. This is of least use to us as we handle most of the data using pandas dataframes while creating machine learning applications. Hence, let us discuss how to perform one hot encoding on a pandas dataframe in Python. 

The process to train the one hot encoder is the same as discussed in the previous examples. We can extract a column from the dataframe and train the one hot encoder using the fit() method. After creating the encoder, we need to create a column transformer to generate one-hot encoded columns in the output dataframe. For this, we will use the make_column_transformer() function.

The make_column_transformer() Function

The make_column_transformer() function has the following syntax.

make_column_transformer(*transformers, remainder='drop', sparse_threshold=0.3, n_jobs=None, verbose=False, verbose_feature_names_out=True)

Here, 

  • The transformers parameter takes a tuple containing the trained OneHotEncoder object and a list of column names on which we want to perform one hot encoding.
  • By default, the remainder parameter is set to ‘drop’. Hence, only the specified columns in the transformers parameter are encoded and produced in the output. If we don’t specify a column name in the transformers parameter, they are dropped from the output. To avoid this, we can set the remainder parameter to ‘passthrough’. After this, all remaining columns that are not specified in the transformers parameter will be automatically passed through and included in the output. This subset of columns is concatenated with the output of the encoders. 
  • You can also pass an untrained OneHotEncoder to the remainder parameter. By setting the remainder parameter to be an encoder, the columns that are not specified in the transformers parameter are encoded using the remainder estimator. Here, the encoder that we pass to the remainder parameter must support fit() and transform() methods.
  • If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. We can set the sparse_threshold parameter to 0 to always return dense data. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and the sparse_threshold parameter is ignored.
  • The n_jobs parameter is used to run the one hot encoder in parallel. By default, it is set to None. It means that only one job will run. You can set it to -1 to run jobs as many as the number of processors in your machine.
  • The verbose parameter is used to print the time elapsed while fitting each encoder. By default, it is set to False.
  • The get_feature_names_out parameter is used to prefix all feature names with the name of the transformer that generated that feature in the one-hot encoded dataframe. By default, it is set to True. If we set it to False, get_feature_names_out will not prefix any feature names and the program will run into an error if the feature names are not unique.

To perform one-hot encoding on the columns of a pandas dataframe, we can create a transformer using the make_column_transformer() function. Then, we will invoke the fit() method on the transformer and pass the input dataframe to the fit() method. After this, we can use the transform() method to generate the array containing the one hot encoded values.

To convert the array into a dataframe, we will use the DataFrame() function defined in the pandas module.  We will also use the get_feature_names_out() method on the trained transformer to get the column names for the one-hot encoded data. You can observe this in the following example.

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
print(df)
values=np.array(df["City"]).reshape(-1,1)
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City"]), remainder='passthrough')
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)

Output:

The dataframe is:
            Name      City
0     John Smith  New York
1     Aditya Raj    Mumbai
2     Will Smith    London
3    Harsh Aryan    London
4  Joel Harrison    Mumbai
5    Bill Warner     Paris
6     Chris Kite  New York
7     Sam Altman    London
8            Joe    London
The output dataframe is:
  onehotencoder__City_London onehotencoder__City_Mumbai  \
0                        0.0                        0.0   
1                        0.0                        1.0   
2                        1.0                        0.0   
3                        1.0                        0.0   
4                        0.0                        1.0   
5                        0.0                        0.0   
6                        0.0                        0.0   
7                        1.0                        0.0   
8                        1.0                        0.0   

  onehotencoder__City_New York onehotencoder__City_Paris remainder__Name  
0                          1.0                       0.0      John Smith  
1                          0.0                       0.0      Aditya Raj  
2                          0.0                       0.0      Will Smith  
3                          0.0                       0.0     Harsh Aryan  
4                          0.0                       0.0   Joel Harrison  
5                          0.0                       1.0     Bill Warner  
6                          1.0                       0.0      Chris Kite  
7                          0.0                       0.0      Sam Altman  
8                          0.0                       0.0             Joe  

In the above example, we have encoded the City column of the input dataframe using one-hot encoding in Python. For this, we first trained the OneHotEncoder using the column data and then we transformed the input dataframe using the make_column_transformer() function. Here, we passed a tuple containing the trained OneHotEncoder object and a list of column names as the first input argument to the make_column_transformer() function. In the above output, you can observe that the column names in the output dataframe look dirty as they all contain the transformer names.

You can set the get_feature_names_out parameter to False to generate clean output column names as shown below.

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
print(df)
values=np.array(df["City"]).reshape(-1,1)
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City"]), remainder='passthrough',verbose_feature_names_out=False)
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)

Output:

The dataframe is:
            Name      City
0     John Smith  New York
1     Aditya Raj    Mumbai
2     Will Smith    London
3    Harsh Aryan    London
4  Joel Harrison    Mumbai
5    Bill Warner     Paris
6     Chris Kite  New York
7     Sam Altman    London
8            Joe    London
The output dataframe is:
  City_London City_Mumbai City_New York City_Paris           Name
0         0.0         0.0           1.0        0.0     John Smith
1         0.0         1.0           0.0        0.0     Aditya Raj
2         1.0         0.0           0.0        0.0     Will Smith
3         1.0         0.0           0.0        0.0    Harsh Aryan
4         0.0         1.0           0.0        0.0  Joel Harrison
5         0.0         0.0           0.0        1.0    Bill Warner
6         0.0         0.0           1.0        0.0     Chris Kite
7         1.0         0.0           0.0        0.0     Sam Altman
8         1.0         0.0           0.0        0.0            Joe

In this example, we have set the verbose_feature_names_out parameter to False in the make_column_transformer() function. Hence, we get the output dataframe with the desired column names.

One Hot Encoding With Multiple Columns of the Pandas Dataframe

To perform one hot encoding on multiple columns in the pandas dataframe at once, we will first obtain values from all the columns and train the one hot encoder. Then, we will pass multiple column names in the list of column names passed to the transformers parameter in the make_column_transformer() function. After this, we can train the column transformer and perform one hot encoding on multiple columns in the pandas dataframe as shown below.

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
df["Grades"]=["A","C", "B", "A", "A","B","B","C","D"]
print(df)
values=df[["City", "Grades"]]
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City", "Grades"]), remainder='passthrough',verbose_feature_names_out=False)
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)

Output:

The dataframe is:
            Name      City Grades
0     John Smith  New York      A
1     Aditya Raj    Mumbai      C
2     Will Smith    London      B
3    Harsh Aryan    London      A
4  Joel Harrison    Mumbai      A
5    Bill Warner     Paris      B
6     Chris Kite  New York      B
7     Sam Altman    London      C
8            Joe    London      D
The output dataframe is:
  City_London City_Mumbai City_New York City_Paris Grades_A Grades_B Grades_C  \
0         0.0         0.0           1.0        0.0      1.0      0.0      0.0   
1         0.0         1.0           0.0        0.0      0.0      0.0      1.0   
2         1.0         0.0           0.0        0.0      0.0      1.0      0.0   
3         1.0         0.0           0.0        0.0      1.0      0.0      0.0   
4         0.0         1.0           0.0        0.0      1.0      0.0      0.0   
5         0.0         0.0           0.0        1.0      0.0      1.0      0.0   
6         0.0         0.0           1.0        0.0      0.0      1.0      0.0   
7         1.0         0.0           0.0        0.0      0.0      0.0      1.0   
8         1.0         0.0           0.0        0.0      0.0      0.0      0.0   

  Grades_D           Name  
0      0.0     John Smith  
1      0.0     Aditya Raj  
2      0.0     Will Smith  
3      0.0    Harsh Aryan  
4      0.0  Joel Harrison  
5      0.0    Bill Warner  
6      0.0     Chris Kite  
7      0.0     Sam Altman  
8      1.0            Joe  

In the above output, you can observe that the City Column and Grade columns are encoded using one-hot encoding in python in a single execution.

Conclusion

In this article, we discussed one hot encoding in Python. We also discussed different implementations of the one hot encoding process using the sklearn module. To learn more about encoding techniques, you can read this article on label encoding in Python. You might also like this article on k-means clustering in Python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Similar Posts