# One Hot Encoding in Python

We use different categorical data encoding techniques while data analysis and machine learning tasks. In this article, we will discuss the basics of one hot encoding. We will also discuss implementing one hot encoding in Python.

## What is One Hot Encoding?

One hot encoding is an encoding technique in which we represent categorical values with numeric arrays of 0s and 1s. In one hot encoding, we use the following steps to encode categorical variables.

• First, we find the number of unique values for a given categorical variable. The length of the array containing one-hot encoded values is equal to the total number of unique values for a given categorical variable.
• Next, we assign an index in the array to each unique value.
• For the one-hot array to represent a categorical value, we set the value in the array to 1 at the index associated with the categorical value. The rest of the values in the array remain 0.
• We create a one-hot encoded array for each value in the categorical variable and assign them to the values.

## One Hot Encoding Numerical Example

To understand how the above algorithm works, let us discuss a numerical example of one hot encoding. For this, we will use the following dataset.

In the above table, suppose that we want to perform one-hot encoding on the `City` column. For this, we will use the following steps.

• First, we will find the unique values in the given column. As there are four unique values `London`, `Mumbai`, `New York`, and `Paris`, the value will be 4.
• Next, we will create an array of length 4 with all 0s for each unique categorical value. So, the one-hot encoded arrays right now are as follows.
• London=[0,0,0,0]
• Mumbai=[0,0,0,0]
• New York=[0,0,0,0]
• Paris=[0,0,0,0]
• After this, we will decide on the index associated with each categorical value in the array. Let us assign index 0 to `London`, 1 to `Mumbai`, 2 to `New York`, and 3 to `Paris`
• Next, we will set the element at the associated index of each categorical value to 1 in the one-hot encoded array. Hence, the one-hot encoded arrays will look as follows.
• London=[1, 0, 0, 0]
• Mumbai=[0, 1, 0, 0]
• New York=[0, 0, 1, 0]
• Paris=[0, 0, 0, 1]

The above one-hot encoded arrays represent the associated categorical value. For example, the array [1, 0, 0, 0] represents the value `London`, [0, 1, 0, 0] represents the value `Mumbai`, and so on.

In most cases, these one-hot encoded arrays are split into different columns in the dataset. Here, each column represents a unique categorical value as shown below.

In the above table, you can observe that we have split the one-hot encoded arrays into columns. In the new columns, the value is set to 1 if the row represents a particular value. Otherwise, it is set to 0. For instance, the `City_London` column for the rows in which `City` is `London` is set to 1 and all the other columns are 0.

## One Hot Encoding in Python Using The sklearn Module

Now, that we have discussed how to perform one hot encoding, we will implement it in Python. For this, we will use the `OneHotEncoder()` function defined in `sklearn.preprocessing` module.

### The OneHotEncoder() Function

The `OneHotEncoder()` function has the following syntax.

``OneHotEncoder(*, categories='auto', drop=None, sparse='deprecated', sparse_output=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None)``

Here,

• The `categories` parameter is used to specify the unique values in the input data. By default, it is set to `“auto”`. Hence, the encoder finds all the unique values themselves. If you want to specify the unique values manually, you can pass a list of all the unique values in the categorical data to the `categories` parameter as its input. The passed values should not mix strings and numeric values within a single feature and should be sorted in the case of numeric values.
• We use the `drop` parameter to reduce the length of one-hot encoded vectors. From the input data, we can represent one unique value with a vector containing all 0s. By this, we can reduce the size of the one-hot encoded vector by one. By default, the `drop` parameter is set to `None`. Hence, all the values are retained.
• You can set the `drop` parameter to `‘first’` to drop the first categorical value. The first categorical value will then be represented by a vector containing all zeros. If only one category is present, the value will be dropped entirely.
•  If we set the `drop` parameter to `‘if_binary’`, the encoder drops the first value in the case of binary variables. Features with 1 or more than 2 categories are left intact.
• The `sparse` parameter has been deprecated and will be removed in the next versions of sklearn. When we set the `sparse` parameter to True, the one-hot encoded values are generated in the form of a sparse matrix. Otherwise, we get an array. The `sparse_output` parameter is the new name for the `sparse` parameter.
• The `dtype` parameter is used to specify the desired data type of the output. By default, it is set to float64. You can change it to any number type such as int32, int64, etc.
• The `min_frequency` parameter is used to specify the minimum frequency below which a category will be considered infrequent. You can pass an integer as the absolute support count or a floating point number to specify the minimum support to decide the infrequent values.
• The `max_categories` parameter specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, the `max_categories` parameter includes the category representing the infrequent categories along with the frequent categories. If we set the `max_categories` parameter to None, there is no limit to the number of output features.
• The `handle_unknown` parameter is used to handle unknown values while generating one-hot encoding using the `transform()` method.
• By default, the `handle_unknown` parameter is set to `error`. Hence, if the data given to the `transform()` method contains new values compared to the data given to the `fit() method, the program runs into an `error.
• You can set the `handle_unknown` parameter to `“ignore”`. After this, if an unknown value is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
• You can also set the one-hot encoded values of new values to existing infrequent values.  For this, you can set the `handle_unknown` parameter to ‘`infrequent_if_exist`’. After this, if an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted ‘`infrequent`‘ if it exists.
• If the ‘`infrequent`‘ category does not exist, then `transform()` method and `inverse_transform()` method will handle an unknown category as with `handle_unknown='ignore'`. Infrequent categories exist based on `min_frequency` and `max_categories`

After execution, the `OneHotEncoder()` function returns an untrained one-hot encoder created using the sklearn module in Python. We can then train the encoder using the `fit()` method. If we want to encode values from a single attribute, the `fit()` method takes a numpy array of shapes (-1, 1). After execution, it returns a trained `OneHotEncoder` object.

We can use the `transform() `method to predict one hot encoded value using the trained `OneHotEncoder` object. The `transform()` method takes the array containing the values for which we need to predict encoded values and returns a sparse array. You can convert the sparse array to one-hot encoded array using the` toarray() `method as shown below.

``````from sklearn.preprocessing import OneHotEncoder
import numpy as np
untrained_encoder = OneHotEncoder(handle_unknown='ignore')
cities=np.array(["New York", "Mumbai", "London", "Paris"]).reshape(-1, 1)
print("The training set is:")
print(cities)
trained_encoder=untrained_encoder.fit(cities)
input_values=np.array(["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]).reshape(-1, 1)
output=trained_encoder.transform(input_values).toarray()
print("The input values are:")
print(input_values)
print("The output is:")
print(output)``````

Output:

``````The training set is:
[['New York']
['Mumbai']
['London']
['Paris']]
The input values are:
[['New York']
['Mumbai']
['London']
['London']
['Mumbai']
['Paris']
['New York']
['London']
['London']]
The output is:
[[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]]``````

In the above example, we have trained an one hot encoder using four values. Then, we passed a list of values to predict the one hot encoded arrays. In the output, you can observe that the arrays are in the same format we discussed in the numerical example.

You can also perform one hot encoding on multiple features using a single `OneHotEncoder` object. For this, you can simply pass the 2-D list containing all the rows and columns as input to the `fit()` method as shown below.

``````from sklearn.preprocessing import OneHotEncoder
import numpy as np
untrained_encoder = OneHotEncoder(handle_unknown='ignore')
cities=[[0,"Mumbai"],[1,"London"],[2,"Paris"],[3,"New York"]]
print("The training set is:")
print(cities)
trained_encoder=untrained_encoder.fit(cities)
print(trained_encoder.categories_)
input_values=[[1,"Mumbai"],[2,"New York"]]
output=trained_encoder.transform(input_values).toarray()
print("The input values are:")
print(input_values)
print("The output is:")
print(output)``````

Output:

``````The training set is:
[[0, 'Mumbai'], [1, 'London'], [2, 'Paris'], [3, 'New York']]
[array([0, 1, 2, 3], dtype=object), array(['London', 'Mumbai', 'New York', 'Paris'], dtype=object)]
The input values are:
[[1, 'Mumbai'], [2, 'New York']]
The output is:
[[0. 1. 0. 0. 0. 1. 0. 0.]
[0. 0. 1. 0. 0. 0. 1. 0.]]``````

In this example, we passed a two-dimensional array to the `fit()` method to perform one hot encoding in Python. Here, the first element of each internal array is considered to belong to a single feature and the second element of each internal array belongs to another feature.

The number of elements in the output array depends on the unique value in both features. As there are 4 unique values in the first feature and four unique values in the second feature, the one hot encoded arrays contain eight elements.

## One Hot Encoding on a Pandas DataFrame in Python

In the previous examples, we discussed how to perform one hot encoding on 1-d and 2-d arrays containing standalone values. This is of least use to us as we handle most of the data using pandas dataframes while creating machine learning applications. Hence, let us discuss how to perform one hot encoding on a pandas dataframe in Python.

The process to train the one hot encoder is the same as discussed in the previous examples. We can extract a column from the dataframe and train the one hot encoder using the `fit()` method. After creating the encoder, we need to create a column transformer to generate one-hot encoded columns in the output dataframe. For this, we will use the `make_column_transformer()` function.

### The make_column_transformer() Function

The `make_column_transformer()` function has the following syntax.

``make_column_transformer(*transformers, remainder='drop', sparse_threshold=0.3, n_jobs=None, verbose=False, verbose_feature_names_out=True)``

Here,

• The `transformers` parameter takes a tuple containing the trained `OneHotEncoder` object and a list of column names on which we want to perform one hot encoding.
• By default, the `remainder` parameter is set to ‘`drop`’. Hence, only the specified columns in the `transformers` parameter are encoded and produced in the output. If we don’t specify a column name in the `transformers` parameter, they are dropped from the output. To avoid this, we can set the `remainder` parameter to ‘`passthrough`’. After this, all remaining columns that are not specified in the `transformers` parameter will be automatically passed through and included in the output. This subset of columns is concatenated with the output of the encoders.
• You can also pass an untrained `OneHotEncoder` to the `remainder` parameter. By setting the `remainder` parameter to be an encoder, the columns that are not specified in the `transformers` parameter are encoded using the remainder estimator. Here, the encoder that we pass to the `remainder` parameter must support `fit()` and `transform()` methods.
• If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. We can set the `sparse_threshold` parameter to 0 to always return dense data. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and the `sparse_threshold` parameter is ignored.
• The `n_jobs` parameter is used to run the one hot encoder in parallel. By default, it is set to None. It means that only one job will run. You can set it to -1 to run jobs as many as the number of processors in your machine.
• The `verbose` parameter is used to print the time elapsed while fitting each encoder. By default, it is set to False.
• The `get_feature_names_out` parameter is used to prefix all feature names with the name of the transformer that generated that feature in the one-hot encoded dataframe. By default, it is set to True. If we set it to False, `get_feature_names_out` will not prefix any feature names and the program will run into an error if the feature names are not unique.

To perform one-hot encoding on the columns of a pandas dataframe, we can create a transformer using the` make_column_transformer()` function. Then, we will invoke the `fit()` method on the transformer and pass the input dataframe to the `fit()` method. After this, we can use the `transform()` method to generate the array containing the one hot encoded values.

To convert the array into a dataframe, we will use the `DataFrame()` function defined in the pandas module.  We will also use the `get_feature_names_out()` method on the trained transformer to get the column names for the one-hot encoded data. You can observe this in the following example.

``````import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
print("The dataframe is:")
print(df)
values=np.array(df["City"]).reshape(-1,1)
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City"]), remainder='passthrough')
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)``````

Output:

``````The dataframe is:
Name      City
0     John Smith  New York
1     Aditya Raj    Mumbai
2     Will Smith    London
3    Harsh Aryan    London
4  Joel Harrison    Mumbai
5    Bill Warner     Paris
6     Chris Kite  New York
7     Sam Altman    London
8            Joe    London
The output dataframe is:
onehotencoder__City_London onehotencoder__City_Mumbai  \
0                        0.0                        0.0
1                        0.0                        1.0
2                        1.0                        0.0
3                        1.0                        0.0
4                        0.0                        1.0
5                        0.0                        0.0
6                        0.0                        0.0
7                        1.0                        0.0
8                        1.0                        0.0

onehotencoder__City_New York onehotencoder__City_Paris remainder__Name
0                          1.0                       0.0      John Smith
1                          0.0                       0.0      Aditya Raj
2                          0.0                       0.0      Will Smith
3                          0.0                       0.0     Harsh Aryan
4                          0.0                       0.0   Joel Harrison
5                          0.0                       1.0     Bill Warner
6                          1.0                       0.0      Chris Kite
7                          0.0                       0.0      Sam Altman
8                          0.0                       0.0             Joe  ``````

In the above example, we have encoded the `City` column of the input dataframe using one-hot encoding in Python. For this, we first trained the `OneHotEncoder` using the column data and then we transformed the input dataframe using the `make_column_transformer()` function. Here, we passed a tuple containing the trained `OneHotEncoder` object and a list of column names as the first input argument to the make_column_transformer() function. In the above output, you can observe that the column names in the output dataframe look dirty as they all contain the transformer names.

You can set the `get_feature_names_out` parameter to False to generate clean output column names as shown below.

``````import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
print("The dataframe is:")
print(df)
values=np.array(df["City"]).reshape(-1,1)
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City"]), remainder='passthrough',verbose_feature_names_out=False)
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)``````

Output:

``````The dataframe is:
Name      City
0     John Smith  New York
1     Aditya Raj    Mumbai
2     Will Smith    London
3    Harsh Aryan    London
4  Joel Harrison    Mumbai
5    Bill Warner     Paris
6     Chris Kite  New York
7     Sam Altman    London
8            Joe    London
The output dataframe is:
City_London City_Mumbai City_New York City_Paris           Name
0         0.0         0.0           1.0        0.0     John Smith
1         0.0         1.0           0.0        0.0     Aditya Raj
2         1.0         0.0           0.0        0.0     Will Smith
3         1.0         0.0           0.0        0.0    Harsh Aryan
4         0.0         1.0           0.0        0.0  Joel Harrison
5         0.0         0.0           0.0        1.0    Bill Warner
6         0.0         0.0           1.0        0.0     Chris Kite
7         1.0         0.0           0.0        0.0     Sam Altman
8         1.0         0.0           0.0        0.0            Joe``````

In this example, we have set the `verbose_feature_names_out` parameter to False in the `make_column_transformer()` function. Hence, we get the output dataframe with the desired column names.

## One Hot Encoding With Multiple Columns of the Pandas Dataframe

To perform one hot encoding on multiple columns in the pandas dataframe at once, we will first obtain values from all the columns and train the one hot encoder. Then, we will pass multiple column names in the list of column names passed to the `transformers` parameter in the `make_column_transformer()` function. After this, we can train the column transformer and perform one hot encoding on multiple columns in the pandas dataframe as shown below.

``````import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
print("The dataframe is:")
df["Grades"]=["A","C", "B", "A", "A","B","B","C","D"]
print(df)
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City", "Grades"]), remainder='passthrough',verbose_feature_names_out=False)
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)``````

Output:

``````The dataframe is:
0     John Smith  New York      A
1     Aditya Raj    Mumbai      C
2     Will Smith    London      B
3    Harsh Aryan    London      A
4  Joel Harrison    Mumbai      A
5    Bill Warner     Paris      B
6     Chris Kite  New York      B
7     Sam Altman    London      C
8            Joe    London      D
The output dataframe is:
0         0.0         0.0           1.0        0.0      1.0      0.0      0.0
1         0.0         1.0           0.0        0.0      0.0      0.0      1.0
2         1.0         0.0           0.0        0.0      0.0      1.0      0.0
3         1.0         0.0           0.0        0.0      1.0      0.0      0.0
4         0.0         1.0           0.0        0.0      1.0      0.0      0.0
5         0.0         0.0           0.0        1.0      0.0      1.0      0.0
6         0.0         0.0           1.0        0.0      0.0      1.0      0.0
7         1.0         0.0           0.0        0.0      0.0      0.0      1.0
8         1.0         0.0           0.0        0.0      0.0      0.0      0.0

0      0.0     John Smith
1      0.0     Aditya Raj
2      0.0     Will Smith
3      0.0    Harsh Aryan
4      0.0  Joel Harrison
5      0.0    Bill Warner
6      0.0     Chris Kite
7      0.0     Sam Altman
8      1.0            Joe  ``````

In the above output, you can observe that the `City` Column and `Grade` columns are encoded using one-hot encoding in python in a single execution.

## Conclusion

In this article, we discussed one hot encoding in Python. We also discussed different implementations of the one hot encoding process using the sklearn module. To learn more about encoding techniques, you can read this article on label encoding in Python. You might also like this article on k-means clustering in Python.