Implement FP Growth Algorithm in Python

Like the apriori algorithm, we also use the fp-growth algorithm to generate frequent itemsets from a transaction dataset in market basket analysis. This article will discuss how to implement the fp growth algorithm in Python.

How to Implement The FP Growth Algorithm in Python?

We will use the mlxtend module in Python to implement the fp growth algorithm. It provides us with the fpgrowth() function to calculate the frequent itemsets and the association_rules() function for association rule mining.

Before implementing the fp growth algorithm, I suggest you read this article on the fp growth algorithm numerical example. This will help you understand how the algorithm actually works. 

Now, let us proceed with the implementation of the fp growth algorithm in Python. For this, we will use the following steps. 

  • First, we will obtain a list containing the lists of items in each transaction from the transaction dataset. 
  • Next, we will use the TransactionEncoder() function to create a transaction array. 
  • Once we get the transaction array, we will use the fpgrowth() function to generate frequent itemsets. 
  • Finally, we will use the association_rules() function to generate association rules.

Create a List of Lists of Items From The Transaction Dataset

For implementing the fp-growth algorithm in Python, we will use the following dataset. 

Transaction IDItems
T1I1, I3, I4
T2I2, I3, I5, I6
T3I1, I2, I3, I5
T4I2, I5
T5I1, I3, I5
FP-Growth Algorithm Dataset

The above transaction datasets contain five transactions having 6 unique items. We will convert the above table into a list of lists of items as shown below.

transactions=[["I1", "I3", "I4"],
            	 ["I2", "I3", "I5", "I6"],
             	["I1", "I2", "I3", "I5"],
             	["I2", "I5"],
             	["I1", "I3", "I5"]]

In the above list, items in each transaction constitute an inner list. We will use this list of lists to create the transaction array

Create Transaction Array Using TransactionEncoder() Function

The fpgrowth() function takes a transaction array as its input. Hence, we will convert the list of items in the transactions to a transaction array. The transaction array has the following features. 

  • Rows in the transaction array represent a transaction and each columns represent items.
  • If an item is present in a transaction, the element at the corresponding row and column will be set to True.
  • If an item isn’t present in a transaction, the element corresponding to the particular row and column is set to False. 

We will use the TransactionEncoder() function defined in the mlxtend module to generate the transaction array. The TransactionEncoder() function returns a TransactionEncoder object. After creating the TransactionEncoder object, we will use the fit() and transform() methods to create the transaction array.  

The fit() method takes the transaction data in the form of a list of lists. Then, the TransactionEncoder object learns all the unique labels in the dataset. Next, we use the transform() method to transform the input dataset into a one-hot encoded boolean array as shown in the following example.

from mlxtend.preprocessing import TransactionEncoder
transactions=[["I1", "I3", "I4"],
             ["I2", "I3", "I5", "I6"],
             ["I1", "I2", "I3", "I5"],
             ["I2", "I5"],
             ["I1", "I3", "I5"]]
print("The list of transactions is:")
print(transactions)
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
print("The transaction array is:")
print(transaction_array)

Output:

The list of transactions is:
[['I1', 'I3', 'I4'], ['I2', 'I3', 'I5', 'I6'], ['I1', 'I2', 'I3', 'I5'], ['I2', 'I5'], ['I1', 'I3', 'I5']]
The transaction array is:
[[ True False  True  True False False]
 [False  True  True False  True  True]
 [ True  True  True False  True False]
 [False  True False False  True False]
 [ True False  True False  True False]]

We will convert the transaction array into a dataframe using the DataFrame() function defined in the pandas module in Python. Here, we will set the item names as the column names in the dataframe. The transaction IDs will constitute the index of the dataframe. You can obtain all the item names using the columns_ attribute of the TransactionEncoder object and create the dataframe as shown below.

from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
transactions=[["I1", "I3", "I4"],
             ["I2", "I3", "I5", "I6"],
             ["I1", "I2", "I3", "I5"],
             ["I2", "I5"],
             ["I1", "I3", "I5"]]
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_,index=["T1","T2","T3","T4","T5"])
print("The transaction dataframe is:")
print(transaction_dataframe)

Output:

The transaction dataframe is:
       I1     I2     I3     I4     I5     I6
T1   True  False   True   True  False  False
T2  False   True   True  False   True   True
T3   True   True   True  False   True  False
T4  False   True  False  False   True  False
T5   True  False   True  False   True  False

Once we get the transaction array in the form of a dataframe, we will use it to implement the fp growth algorithm in Python.

Generate Frequent Itemsets Using the fpgrowth() Function

After generating the transaction array using the transaction encoder, we will use the fpgrowth() function to implement the fp growth algorithm in Python. The fpgrowth() function has the following syntax.

fpgrowth(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)

Here, 

  • The df parameter takes the dataframe containing the transaction matrix as its input. 
  • The min_support parameter takes the minimum support that we want to specify for each item set. It should be a floating point number between 0 and 1. By default, it has a value of 0.5.
  • use_colnames parameter is used to specify if we want to use the column names of the input df as the item names. By default, use_colnames is set to False. Due to this, the fpgrowth() function uses the index of the columns instead of the column names as item names. To use the column names of the input df as item names, we will set the use_colnames parameter to True.
  • We use the max_len parameter to define the maximum number of items in an itemset. By default, it is set to None denoting that all possible itemsets lengths are evaluated.
  • We can use the verbose parameter to show the execution stage for the fp growth algorithm. You can set the verbose parameter to a value greater than 1 to show the number of iterations when the low_memory parameter is True.  When the verbose parameter is set to 1 and low_memory is set to False, the function shows the number of combinations while executing the apriori algorithm.

After execution, the fpgrowth() function returns a dataframe with columns 'support', and 'itemsets' having all the itemsets that have the support greater than or equal to the min_support and length less than max_len if max_len is not None. 

To calculate the frequent itemsets, we will use a support of 0.4 and set the use_colnames parameter to True to use the column names of the input dataframe as item names as shown below. 

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth
import pandas as pd
transactions=[["I1", "I3", "I4"],
             ["I2", "I3", "I5", "I6"],
             ["I1", "I2", "I3", "I5"],
             ["I2", "I5"],
             ["I1", "I3", "I5"]]
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_,index=["T1","T2","T3","T4","T5"])
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.4, use_colnames=True)
print("The frequent itemsets are:")
print(freuent_itemsets)

Output:

The frequent itemsets are:
    support      itemsets
0       0.8          (I3)
1       0.6          (I1)
2       0.8          (I5)
3       0.6          (I2)
4       0.6      (I5, I3)
5       0.6      (I1, I3)
6       0.4      (I1, I5)
7       0.4  (I1, I5, I3)
8       0.6      (I5, I2)
9       0.4      (I3, I2)
10      0.4  (I5, I3, I2)

In the above output, you can observe that the fpgrowth() function returns a dataframe containing the frequent itemsets and their support.

Generate Association Rules Using The association_rules() Function

After generating the frequent itemsets using the fpgrowth() function, we can use the association_rules() function to find association rules in the dataset. The association_rules() function has the following syntax. 

association_rules(df, metric='confidence', min_threshold=0.8, support_only=False)

Here, 

  • The df parameter takes the dataframe returned from the fpgrowth() function as its input. The dataframe must contain the columns 'support‘ and ‘itemsets‘ containing frequent itemsets and their support.
  • The metric parameter defines the metric used to select the association rules. We can specify any of the following metrics.
    • “support”: Support for an association rule is calculated as the sum of support of the antecedent and the consequent. It has a range of [0,1].
    • “confidence”: The confidence of an association rule is calculated as the support of the antecedent and consequent combined divided by the support of the antecedent.  It has a range of [0,1].
    • “lift”: The lift for an association rule is defined as the confidence of the association rule divided by the support of the consequent. It has a range of [0, infinity].
    • “leverage”: The leverage of an association rule is defined as the ratio of support of the association rule to the product of support of antecedent and consequent. It has a range of [-1,1].
    • “conviction”: The conviction of an association rule is defined as (1-support of consequent) divided by (1- confidence of the association rule). It has a range of [0, infinity].
    • “zhangs_metric”: It is calculated as leverage of the association rule/max (support of the association rule*(1-support of the antecedent), support of the antecedent*(support of the consequent-support of the association rule)). It has a range of [-1,1].
  • We use the min_threshold parameter to specify the minimum value of the metric defined in the metric parameter to filter the useful association rules. By default, it has a value of 0.8.
  • We use the  support_only parameter to specify if we only want to compute the support of the association rules and fill the other metric columns with NaNs. You can use this parameter if the input dataframe is incomplete and does not contain support values for all rule antecedents and consequents. By setting the support_only parameter to True, you can also speed up the computation because you don’t calculate the other metrics for the association rules.

After execution, the association_rules() function returns a dataframe containing the ‘antecedents’, ‘consequents’, ‘antecedent support’, ‘consequent support’, ‘support’, ‘confidence’, ‘lift’, ‘leverage’, and ‘conviction’ for all the generated association rules.  You can observe this in the following example.

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth,association_rules
import pandas as pd
transactions=[["I1", "I3", "I4"],
             ["I2", "I3", "I5", "I6"],
             ["I1", "I2", "I3", "I5"],
             ["I2", "I5"],
             ["I1", "I3", "I5"]]
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_,index=["T1","T2","T3","T4","T5"])
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.4, use_colnames=True)
association_rules_df=association_rules(freuent_itemsets, metric="confidence", min_threshold=.7)
print("The association rules are:")
print(association_rules_df)

Output:

The association rules are:
  antecedents consequents  antecedent support  consequent support  support  \
0        (I5)        (I3)                 0.8                 0.8      0.6   
1        (I3)        (I5)                 0.8                 0.8      0.6   
2        (I1)        (I3)                 0.6                 0.8      0.6   
3        (I3)        (I1)                 0.8                 0.6      0.6   
4    (I1, I5)        (I3)                 0.4                 0.8      0.4   
5        (I5)        (I2)                 0.8                 0.6      0.6   
6        (I2)        (I5)                 0.6                 0.8      0.6   
7    (I3, I2)        (I5)                 0.4                 0.8      0.4   

   confidence    lift  leverage  conviction  
0        0.75  0.9375     -0.04         0.8  
1        0.75  0.9375     -0.04         0.8  
2        1.00  1.2500      0.12         inf  
3        0.75  1.2500      0.12         1.6  
4        1.00  1.2500      0.08         inf  
5        0.75  1.2500      0.12         1.6  
6        1.00  1.2500      0.12         inf  
7        1.00  1.2500      0.08         inf  

Now, that we have discussed each step for implementing the fp growth algorithm in Python, let us use a real dataset for the implementation. You can download the dataset using this link. For this article, I have downloaded and renamed the dataset to transaction_dataset.csv.

Implement FP Growth Algorithm in Python on Real Data

To implement the fp growth algorithm in Python on a real-world dataset, we will first load the dataset into our program using the read_csv() function defined in the pandas module. The read_csv() function takes the filename as its input argument and returns a dataframe containing the data in the file as shown below.

import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
print("The dataset is:")
print(dataset.head())

Output:

The dataset is:
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

      InvoiceDate  UnitPrice  CustomerID         Country  
0  12/1/2010 8:26       2.55     17850.0  United Kingdom  
1  12/1/2010 8:26       3.39     17850.0  United Kingdom  
2  12/1/2010 8:26       2.75     17850.0  United Kingdom  
3  12/1/2010 8:26       3.39     17850.0  United Kingdom  
4  12/1/2010 8:26       3.39     17850.0  United Kingdom  

You can observe that the input dataset contains eight columns.

Preprocess data to get the appropriate dataset

We can’t directly implement the fp-growth algorithm on the above dataset, hence, we need to perform data preprocessing to get a list of lists of items in the transactions.

From the above dataset, we only need the transaction id ie. “InvoiceNo” and item id i.e. “StockCode” columns. Hence, we will drop other columns in the dataframe using the drop() method. We will also drop the rows in which InvoiceNo or StockCode are null values.

import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
print("The dataset is:")
print(dataset.head())

Output:

The dataset is:
  InvoiceNo StockCode
0    536365    85123A
1    536365     71053
2    536365    84406B
3    536365    84029G
4    536365    84029E

In the above output, you can observe that each row contains only one item. Hence, we will group all the items of a particular transaction in a single row. For this, we will use the groupby() method, the apply() method, and the list() function.

  • The groupby() method, when invoked on a dataframe, takes the column name i.e. “InvoiceNo” as its input argument. After execution, it groups the rows for a particular InvoiceNo into small dataframes. 
  • Next, we will make a list of all items in the transaction by applying the list() function on the  “StockCode” column of each grouped dataframe using the apply() method.

After executing the above methods, we will get a dataframe containing the transaction id and the corresponding items as shown below.

import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
print("The transaction data is:")
print(transaction_data.head())

Output:

The transaction data is:
  InvoiceNo                                              Items
0    536365  [85123A, 71053, 84406B, 84029G, 84029E, 22752,...
1    536366                                     [22633, 22632]
2    536367  [84879, 22745, 22748, 22749, 22310, 84969, 226...
3    536368                       [22960, 22913, 22912, 22914]
4    536369                                            [21756]

Finally, we will select the Items column from the dataframe to create a list of lists of items in each transaction using the tolist() method as shown below.

import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
transactions=transaction_data["Items"].tolist()
print("The transactions are:")
print(transactions[0:10])

Output:

The transactions are:
[['85123A', '71053', '84406B', '84029G', '84029E', '22752', '21730'], ['22633', '22632'], ['84879', '22745', '22748', '22749', '22310', '84969', '22623', '22622', '21754', '21755', '21777', '48187'], ['22960', '22913', '22912', '22914'], ['21756'], ['22728', '22727', '22726', '21724', '21883', '10002', '21791', '21035', '22326', '22629', '22659', '22631', '22661', '21731', '22900', '21913', '22540', '22544', '22492', 'POST'], ['22086'], ['22632', '22633'], ['85123A', '71053', '84406B', '20679', '37370', '21871', '21071', '21068', '82483', '82486', '82482', '82494L', '84029G', '84029E', '22752', '21730'], ['21258']]

At this step, we obtained the list of lists we can use to implement the fp growth algorithm in Python.

Generate Frequent Itemsets and Association Rules

To generate the frequent itemsets using the list of transactions. we will use the following steps.

  • First, we will use the TransactionEncoder() function to generate the transaction array.
  • Next, we will use the fpgrowth() function to obtain the frequent itemsets. Here, we will use the minimum support of 0.02 and set the use_colnames parameter to True to use the column names of the input dataframe.

After execution of the fpgrowth() function, we will get the frequent itemsets with their support as shown below.

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth
import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
transactions=transaction_data["Items"].tolist()
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_)
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.02, use_colnames=True)
print("The frequent itemsets are:")
print(freuent_itemsets)

Output:

The frequent itemsets are:
      support               itemsets
0    0.086718               (85123A)
1    0.056680                (84879)
2    0.030386                (21754)
3    0.024363                (21755)
4    0.023243                (48187)
..        ...                    ...
215  0.021197  (22697, 22699, 22698)
216  0.021429        (23199, 85099B)
217  0.020000         (23203, 23202)
218  0.022471        (23203, 85099B)
219  0.021197         (23300, 23301)

[220 rows x 2 columns]

You can also find all the association rules using the association_rules() function as shown below.

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth,association_rules
import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
transactions=transaction_data["Items"].tolist()
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_)
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.02, use_colnames=True)
association_rules_df=association_rules(freuent_itemsets, metric="confidence", min_threshold=.50)
print("The association rules are:")
print(association_rules_df.head())

Output:

The association rules are:
  antecedents consequents  antecedent support  consequent support   support  \
0     (22727)     (22726)            0.041737            0.038726  0.024942   
1     (22726)     (22727)            0.038726            0.041737  0.024942   
2     (22386)    (85099B)            0.047529            0.082432  0.032162   
3     (21931)    (85099B)            0.046371            0.082432  0.028301   
4    (85099C)    (85099B)            0.036564            0.082432  0.022896   

   confidence       lift  leverage  conviction  
0    0.597595  15.431412  0.023326    2.388821  
1    0.644068  15.431412  0.023326    2.692261  
2    0.676686   8.208973  0.028244    2.838004  
3    0.610325   7.403939  0.024479    2.354698  
4    0.626188   7.596379  0.019882    2.454623  

Conclusion

In this article, we have discussed how to implement the fp growth algorithm in Python. To know more about data mining, you can read this article on the apriori algorithm in Python. You might also like this article on categorical data encoding techniques

I hope you enjoyed this article. Stay tuned for more informative articles. 

Happy Learning!

Similar Posts