Market basket analysis using apriori alorithm in python

example of market basket analysis


Introduction :

You are at market and you see several items placed together like soaps are kept alongside shampoos, then you visit another market and see the same combination of items together, coincidence? Well not exactly.

Consider this as the blessing of Market basket analysis technique.

What is Market Basket Analysis?

Market Basket Analysis also known as Affinity Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.
It uses association rule mining technique to find relevant association between items.  

Association rule:

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases.
This techniqueisn’t limited to shopping carts. Other areas where this is used include analysis of fraudulent insurance claims or credit card purchases. 

Another real life scenario of association rule that we come across is when we buy some product from amazon.



from above example, when i try to order a torch we can clearly see what people have frequently bought while ordering torch that is batteries, which is quite obvious for torch to work we would need batteries so amazon have a algorithm that has found association of how frequently these 2 products were bought so that we don't have to seperatly search for batteries.
This association between torch and batteries is one of the many that can be formed.


About apriori algorithm :

Apriori algorithm, a classic algorithm, is useful in mining frequent itemsets and relevant association rules. It is given by R. Agrawal and R. Srikant in 1994, name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.
To improve the efficiency of level-wise generation of frequent itemsets, an important property is used called Apriori property.

Apriori Property:

 
Assume you have 10k records in your dataset and you were to find association between them just imagine how many rules would be created among these products.
There are three matrices to measure association and to reduce the less frequent items and their rules,
 

Say you have two items (A) and (B),
  • Support : It is the frequency of (A) and (B) items by the total number of transaction. Using support we can filter out the items that have been bought less frequently.
  • Confidence :  It tells us how often items (A) and (B) occurs together given the numer of  times (A) occurs.

  • Lift : It indicates the strength of a rule and is used to compare actual confidence with expected confidence.

Better understand this algorithm with a fun-size example:

Consider below set of transactions,


Now select some support value and cofidence threshold to filter less frequent itemset, so for this example we take,
  1. Support = 50%
  2. Threshold Confidence = 70%
Create a new table containing distinct itemset and calculate support for those itemset,
Note:- Support of item bread will be calculate as no. of bread in above table i.e,(2) divided by total no. of transaction (4). 2/4 would give us 50% as support for bread similarly we calculate for all itemset.

Now filter/remove all those itemset that do have minimum support value that we said earlier that is 50%,so that would remove itemset Beer,Egg,Ketchup and create new table that has satisfied minimum support value.



Now create another table with itemset(Bread,Jam,Milk) but this time we will create pairs of these items. 

Note:- In above we would not conside itemset highlighted in yellow as {Bread,Jam} is as good as {Jan,Bread}.
Calculate support values for these itemset, for {Bread,Jam} count is (2) as they have appeared twice as a pair in our original dataset and we would divide their count (2) with total transaction(4) to find support which would give us 50% and similarly calculate for all itemset.

Filter/remove itemset that do not satisfy support value of 50% so we would remove  {Bread,Milk} & {Jam,Milk} only itemset left is {Bread,Jam}.

We can see now more combinations cannot be formed so we stop here only and find confidence.
So from above itemset {Bread,Jam} formula to find their confidence would be,

Freq(Bread,Jam) is 2 i.e.,no. of times these both item appeared together in dataset and Freq(Bread) is 2 so that would give us,
 
Threshold confidence that we kept was 70% and for itemset{Bread,Jam}we got confidence value as 100% which clearly states that whenever someone purchased bread they also purchased jam along with it.

Limitations of Apriori Algorithm :

Apriori algorithm can be very slow on large dataset as it find association rule among them, so try to find a best value for matrices(support,confidence,lift) to filter out appropriate associations. 

This concludes our explanation on Apriori algorithm.


I completed my internship on "Market Basket Analysis using Apriori algorithm" from Suven Consultancy & technology Pvt. Ltd. under the guidance of Rocky Jagtiani.

Implementation :

Now we would implement this algorithm on very popular groceries dataset.

  • Import all the necessary modules, apriori and association_rules modules are in mlxtend package.
 
  • Read dataset in a dataframe named groceries.

Note:- we have added column names seperately as item1 ... item32 as this dataset has no header.

Quick glance of dataframe, i have displayed below dataframe till column item8 but there are more columns till item32, also column that has no value/Null will be displayed as NaN.


  • Getting shape of dataframe states that we have 9835 rows/records and 32 columns and each column consist of a item.
 
  • Now we would find the top 10 "sold items" that occur in the dataset, for that,
    • Iterate through all the rows and columns of dataframe and store each value of each row as list, so we would create 1 list as trans consisting of 9835 sub list.
    • After that i have used Counter module to count occurance of each item in above trans list object and removed the NaN values during counting and converted this counter object c as dictionary object in dc.This dc object holds item name and their total occurance count as key value pair.
    •   As we have item name and total count of these items we can convert this dc object into dataframe, i have passed column name as 'Item_name' and 'Item_count', then sorted this dataframe df_Per in descending and generated new index. Lastly i have added new column as 'Item_percent' that consist of perecntage of a item count by the total count of all items.  
    • Now that we have sorted this dataframe we can get top 10 itemsold by,
 

    • From below image we can say that only top 5 items (whole milk, other vegetables, rolls/buns, soda, yogurt) are responsible for 21.47 % of entire sale.
    • After getting cumulative percentage of top 20 items we concluded that it responsible for 50.38 % of entire sale. So in our further analysis we would include only these top 20 items which is sufficient for apriori algorithm and by doing this we would increase efficiency of our model.
    • Below is just visual respresentation of top 5 item sold.
  • Now we would create a function prune_dataset that would prune the dataset based on some parameters.
    • Parameters for this function would be,
      • Dataframe to be pruned. i.e, groceries.
      • Atleat specified no. of items to be present in each transaction/rows i.e, if we pass length_trans=2 than atleast 2 items should be present in a transaction.
      • Cumulative sales percentage for all items should be i.e., if we pass total_sales_perc=0.5 than we should get only top 20 items as we earlier saw that they account for 50% of entire sale.
    • In below function, we have used TransactionEncoder(), then filtered items based on length_trans and total_sales _perc parameter.
    • From below code we can see that our dataframe(groceries) has been greatly reduced and this would improve efficiency of our algorithm.

  • Applying Apriori algorithm and obtain rules:
    •  In below code we have passed output_df(pruned_dataset) with support of 0.4%, after that we are creating rules based on lift metric with minimum threshold of 1.
  • Decoding output
    • We will try to decode output by picking 1 rule like below,

      • Support = 0.04, this means that {wholemilk,whipped/sour cream} itemset showed up together in 4% of transactions.
      • Confidence = 0.46, this means that if someone buys {whole milk} then they will also buy {whipped/sour cream} 46% of time.
      • Lift = 1.49, this means that {whipped/sour cream} are 1.49 time more likely to be bought together with {whole milk}.
  • Finally we can change/tune the parameters to get different set of rules.
  

Conclusion:

  1. We can use apriori algorithm to get associations between items.
  2. After we get these associations we can accordingly keep offers on these strongly associated item.































Comments

Post a Comment