Implement FP Growth Algorithm in Python
Like the apriori algorithm, we also use the fp-growth algorithm to generate frequent itemsets from a transaction dataset in market basket analysis. This article will discuss how to implement the fp growth algorithm in Python.
How to Implement The FP Growth Algorithm in Python?
We will use the mlxtend module in Python to implement the fp growth algorithm. It provides us with the fpgrowth()
function to calculate the frequent itemsets and the association_rules()
function for association rule mining.
Before implementing the fp growth algorithm, I suggest you read this article on the fp growth algorithm numerical example. This will help you understand how the algorithm actually works.
Now, let us proceed with the implementation of the fp growth algorithm in Python. For this, we will use the following steps.
- First, we will obtain a list containing the lists of items in each transaction from the transaction dataset.
- Next, we will use the
TransactionEncoder()
function to create a transaction array. - Once we get the transaction array, we will use the
fpgrowth()
function to generate frequent itemsets. - Finally, we will use the
association_rules()
function to generate association rules.
Create a List of Lists of Items From The Transaction Dataset
For implementing the fp-growth algorithm in Python, we will use the following dataset.
Transaction ID | Items |
T1 | I1, I3, I4 |
T2 | I2, I3, I5, I6 |
T3 | I1, I2, I3, I5 |
T4 | I2, I5 |
T5 | I1, I3, I5 |
The above transaction datasets contain five transactions having 6 unique items. We will convert the above table into a list of lists of items as shown below.
transactions=[["I1", "I3", "I4"],
["I2", "I3", "I5", "I6"],
["I1", "I2", "I3", "I5"],
["I2", "I5"],
["I1", "I3", "I5"]]
In the above list, items in each transaction constitute an inner list. We will use this list of lists to create the transaction array
Create Transaction Array Using TransactionEncoder() Function
The fpgrowth()
function takes a transaction array as its input. Hence, we will convert the list of items in the transactions to a transaction array. The transaction array has the following features.
- Rows in the transaction array represent a transaction and each columns represent items.
- If an item is present in a transaction, the element at the corresponding row and column will be set to True.
- If an item isn’t present in a transaction, the element corresponding to the particular row and column is set to False.
We will use the TransactionEncoder()
function defined in the mlxtend module to generate the transaction array. The TransactionEncoder()
function returns a TransactionEncoder object. After creating the TransactionEncoder object, we will use the fit()
and transform()
methods to create the transaction array.
The fit()
method takes the transaction data in the form of a list of lists. Then, the TransactionEncoder object learns all the unique labels in the dataset. Next, we use the transform()
method to transform the input dataset into a one-hot encoded boolean array as shown in the following example.
from mlxtend.preprocessing import TransactionEncoder
transactions=[["I1", "I3", "I4"],
["I2", "I3", "I5", "I6"],
["I1", "I2", "I3", "I5"],
["I2", "I5"],
["I1", "I3", "I5"]]
print("The list of transactions is:")
print(transactions)
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
print("The transaction array is:")
print(transaction_array)
Output:
The list of transactions is:
[['I1', 'I3', 'I4'], ['I2', 'I3', 'I5', 'I6'], ['I1', 'I2', 'I3', 'I5'], ['I2', 'I5'], ['I1', 'I3', 'I5']]
The transaction array is:
[[ True False True True False False]
[False True True False True True]
[ True True True False True False]
[False True False False True False]
[ True False True False True False]]
We will convert the transaction array into a dataframe using the DataFrame()
function defined in the pandas module in Python. Here, we will set the item names as the column names in the dataframe. The transaction IDs will constitute the index of the dataframe. You can obtain all the item names using the columns_
attribute of the TransactionEncoder object and create the dataframe as shown below.
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
transactions=[["I1", "I3", "I4"],
["I2", "I3", "I5", "I6"],
["I1", "I2", "I3", "I5"],
["I2", "I5"],
["I1", "I3", "I5"]]
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_,index=["T1","T2","T3","T4","T5"])
print("The transaction dataframe is:")
print(transaction_dataframe)
Output:
The transaction dataframe is:
I1 I2 I3 I4 I5 I6
T1 True False True True False False
T2 False True True False True True
T3 True True True False True False
T4 False True False False True False
T5 True False True False True False
Once we get the transaction array in the form of a dataframe, we will use it to implement the fp growth algorithm in Python.
Generate Frequent Itemsets Using the fpgrowth() Function
After generating the transaction array using the transaction encoder, we will use the fpgrowth()
function to implement the fp growth algorithm in Python. The fpgrowth()
function has the following syntax.
fpgrowth(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)
Here,
- The
df
parameter takes the dataframe containing the transaction matrix as its input. - The
min_support
parameter takes the minimum support that we want to specify for each item set. It should be a floating point number between 0 and 1. By default, it has a value of 0.5. use_colnames
parameter is used to specify if we want to use the column names of the input df as the item names. By default,use_colnames
is set to False. Due to this, thefpgrowth()
function uses the index of the columns instead of the column names as item names. To use the column names of the inputdf
as item names, we will set theuse_colnames
parameter to True.- We use the
max_len
parameter to define the maximum number of items in an itemset. By default, it is set to None denoting that all possible itemsets lengths are evaluated. - We can use the
verbose
parameter to show the execution stage for the fp growth algorithm. You can set theverbose
parameter to a value greater than 1 to show the number of iterations when thelow_memory
parameter is True. When the verbose parameter is set to 1 andlow_memory
is set to False, the function shows the number of combinations while executing the apriori algorithm.
After execution, the fpgrowth()
function returns a dataframe with columns 'support'
, and 'itemsets'
having all the itemsets that have the support greater than or equal to the min_support
and length less than max_len
if max_len
is not None.
To calculate the frequent itemsets, we will use a support of 0.4 and set the use_colnames
parameter to True to use the column names of the input dataframe as item names as shown below.
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth
import pandas as pd
transactions=[["I1", "I3", "I4"],
["I2", "I3", "I5", "I6"],
["I1", "I2", "I3", "I5"],
["I2", "I5"],
["I1", "I3", "I5"]]
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_,index=["T1","T2","T3","T4","T5"])
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.4, use_colnames=True)
print("The frequent itemsets are:")
print(freuent_itemsets)
Output:
The frequent itemsets are:
support itemsets
0 0.8 (I3)
1 0.6 (I1)
2 0.8 (I5)
3 0.6 (I2)
4 0.6 (I5, I3)
5 0.6 (I1, I3)
6 0.4 (I1, I5)
7 0.4 (I1, I5, I3)
8 0.6 (I5, I2)
9 0.4 (I3, I2)
10 0.4 (I5, I3, I2)
In the above output, you can observe that the fpgrowth()
function returns a dataframe containing the frequent itemsets and their support.
Generate Association Rules Using The association_rules() Function
After generating the frequent itemsets using the fpgrowth()
function, we can use the association_rules()
function to find association rules in the dataset. The association_rules()
function has the following syntax.
association_rules(df, metric='confidence', min_threshold=0.8, support_only=False)
Here,
- The
df
parameter takes the dataframe returned from thefpgrowth()
function as its input. The dataframe must contain the columns'support‘
and‘itemsets‘
containing frequent itemsets and their support. - The
metric
parameter defines the metric used to select the association rules. We can specify any of the following metrics.“support”
: Support for an association rule is calculated as the sum of support of the antecedent and the consequent. It has a range of [0,1].“confidence”
: The confidence of an association rule is calculated as the support of the antecedent and consequent combined divided by the support of the antecedent. It has a range of [0,1].“lift”
: The lift for an association rule is defined as the confidence of the association rule divided by the support of the consequent. It has a range of [0, infinity].“leverage”
: The leverage of an association rule is defined as the ratio of support of the association rule to the product of support of antecedent and consequent. It has a range of [-1,1].“conviction”
: The conviction of an association rule is defined as (1-support of consequent) divided by (1- confidence of the association rule). It has a range of [0, infinity].“zhangs_metric”
: It is calculated as leverage of the association rule/max (support of the association rule*(1-support of the antecedent), support of the antecedent*(support of the consequent-support of the association rule)). It has a range of [-1,1].
- We use the
min_threshold
parameter to specify the minimum value of the metric defined in themetric
parameter to filter the useful association rules. By default, it has a value of 0.8. - We use the
support_only
parameter to specify if we only want to compute the support of the association rules and fill the other metric columns with NaNs. You can use this parameter if the input dataframe is incomplete and does not contain support values for all rule antecedents and consequents. By setting thesupport_only
parameter to True, you can also speed up the computation because you don’t calculate the other metrics for the association rules.
After execution, the association_rules()
function returns a dataframe containing the ‘antecedents
’, ‘consequents
’, ‘antecedent support’
, ‘consequent support’
, ‘support’
, ‘confidence’
, ‘lift’
, ‘leverage’
, and ‘conviction’
for all the generated association rules. You can observe this in the following example.
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth,association_rules
import pandas as pd
transactions=[["I1", "I3", "I4"],
["I2", "I3", "I5", "I6"],
["I1", "I2", "I3", "I5"],
["I2", "I5"],
["I1", "I3", "I5"]]
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_,index=["T1","T2","T3","T4","T5"])
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.4, use_colnames=True)
association_rules_df=association_rules(freuent_itemsets, metric="confidence", min_threshold=.7)
print("The association rules are:")
print(association_rules_df)
Output:
The association rules are:
antecedents consequents antecedent support consequent support support \
0 (I5) (I3) 0.8 0.8 0.6
1 (I3) (I5) 0.8 0.8 0.6
2 (I1) (I3) 0.6 0.8 0.6
3 (I3) (I1) 0.8 0.6 0.6
4 (I1, I5) (I3) 0.4 0.8 0.4
5 (I5) (I2) 0.8 0.6 0.6
6 (I2) (I5) 0.6 0.8 0.6
7 (I3, I2) (I5) 0.4 0.8 0.4
confidence lift leverage conviction
0 0.75 0.9375 -0.04 0.8
1 0.75 0.9375 -0.04 0.8
2 1.00 1.2500 0.12 inf
3 0.75 1.2500 0.12 1.6
4 1.00 1.2500 0.08 inf
5 0.75 1.2500 0.12 1.6
6 1.00 1.2500 0.12 inf
7 1.00 1.2500 0.08 inf
Now, that we have discussed each step for implementing the fp growth algorithm in Python, let us use a real dataset for the implementation. You can download the dataset using this link. For this article, I have downloaded and renamed the dataset to transaction_dataset.csv
.
Implement FP Growth Algorithm in Python on Real Data
To implement the fp growth algorithm in Python on a real-world dataset, we will first load the dataset into our program using the read_csv()
function defined in the pandas module. The read_csv()
function takes the filename as its input argument and returns a dataframe containing the data in the file as shown below.
import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
print("The dataset is:")
print(dataset.head())
Output:
The dataset is:
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
InvoiceDate UnitPrice CustomerID Country
0 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 12/1/2010 8:26 3.39 17850.0 United Kingdom
You can observe that the input dataset contains eight columns.
Preprocess data to get the appropriate dataset
We can’t directly implement the fp-growth algorithm on the above dataset, hence, we need to perform data preprocessing to get a list of lists of items in the transactions.
From the above dataset, we only need the transaction id ie. “InvoiceNo”
and item id i.e. “StockCode”
columns. Hence, we will drop other columns in the dataframe using the drop()
method. We will also drop the rows in which InvoiceNo
or StockCode
are null values.
import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
print("The dataset is:")
print(dataset.head())
Output:
The dataset is:
InvoiceNo StockCode
0 536365 85123A
1 536365 71053
2 536365 84406B
3 536365 84029G
4 536365 84029E
In the above output, you can observe that each row contains only one item. Hence, we will group all the items of a particular transaction in a single row. For this, we will use the groupby()
method, the apply()
method, and the list()
function.
- The
groupby()
method, when invoked on a dataframe, takes the column name i.e.“InvoiceNo”
as its input argument. After execution, it groups the rows for a particular InvoiceNo into small dataframes. - Next, we will make a list of all items in the transaction by applying the
list()
function on the“StockCode”
column of each grouped dataframe using theapply()
method.
After executing the above methods, we will get a dataframe containing the transaction id and the corresponding items as shown below.
import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
print("The transaction data is:")
print(transaction_data.head())
Output:
The transaction data is:
InvoiceNo Items
0 536365 [85123A, 71053, 84406B, 84029G, 84029E, 22752,...
1 536366 [22633, 22632]
2 536367 [84879, 22745, 22748, 22749, 22310, 84969, 226...
3 536368 [22960, 22913, 22912, 22914]
4 536369 [21756]
Finally, we will select the Items column from the dataframe to create a list of lists of items in each transaction using the tolist()
method as shown below.
import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
transactions=transaction_data["Items"].tolist()
print("The transactions are:")
print(transactions[0:10])
Output:
The transactions are:
[['85123A', '71053', '84406B', '84029G', '84029E', '22752', '21730'], ['22633', '22632'], ['84879', '22745', '22748', '22749', '22310', '84969', '22623', '22622', '21754', '21755', '21777', '48187'], ['22960', '22913', '22912', '22914'], ['21756'], ['22728', '22727', '22726', '21724', '21883', '10002', '21791', '21035', '22326', '22629', '22659', '22631', '22661', '21731', '22900', '21913', '22540', '22544', '22492', 'POST'], ['22086'], ['22632', '22633'], ['85123A', '71053', '84406B', '20679', '37370', '21871', '21071', '21068', '82483', '82486', '82482', '82494L', '84029G', '84029E', '22752', '21730'], ['21258']]
At this step, we obtained the list of lists we can use to implement the fp growth algorithm in Python.
Generate Frequent Itemsets and Association Rules
To generate the frequent itemsets using the list of transactions. we will use the following steps.
- First, we will use the
TransactionEncoder()
function to generate the transaction array. - Next, we will use the
fpgrowth()
function to obtain the frequent itemsets. Here, we will use the minimum support of 0.02 and set theuse_colnames
parameter to True to use the column names of the input dataframe.
After execution of the fpgrowth()
function, we will get the frequent itemsets with their support as shown below.
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth
import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
transactions=transaction_data["Items"].tolist()
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_)
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.02, use_colnames=True)
print("The frequent itemsets are:")
print(freuent_itemsets)
Output:
The frequent itemsets are:
support itemsets
0 0.086718 (85123A)
1 0.056680 (84879)
2 0.030386 (21754)
3 0.024363 (21755)
4 0.023243 (48187)
.. ... ...
215 0.021197 (22697, 22699, 22698)
216 0.021429 (23199, 85099B)
217 0.020000 (23203, 23202)
218 0.022471 (23203, 85099B)
219 0.021197 (23300, 23301)
[220 rows x 2 columns]
You can also find all the association rules using the association_rules()
function as shown below.
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth,association_rules
import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
transactions=transaction_data["Items"].tolist()
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_)
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.02, use_colnames=True)
association_rules_df=association_rules(freuent_itemsets, metric="confidence", min_threshold=.50)
print("The association rules are:")
print(association_rules_df.head())
Output:
The association rules are:
antecedents consequents antecedent support consequent support support \
0 (22727) (22726) 0.041737 0.038726 0.024942
1 (22726) (22727) 0.038726 0.041737 0.024942
2 (22386) (85099B) 0.047529 0.082432 0.032162
3 (21931) (85099B) 0.046371 0.082432 0.028301
4 (85099C) (85099B) 0.036564 0.082432 0.022896
confidence lift leverage conviction
0 0.597595 15.431412 0.023326 2.388821
1 0.644068 15.431412 0.023326 2.692261
2 0.676686 8.208973 0.028244 2.838004
3 0.610325 7.403939 0.024479 2.354698
4 0.626188 7.596379 0.019882 2.454623
Conclusion
In this article, we have discussed how to implement the fp growth algorithm in Python. To know more about data mining, you can read this article on the apriori algorithm in Python. You might also like this article on categorical data encoding techniques.
I hope you enjoyed this article. Stay tuned for more informative articles.
Happy Learning!