Clustering: Apriori

2020-02-11Blogs

Association Rule

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. For example, we may want to find 1-1 product category assocaition rule: product cateogry 1 -> product category 2

This is often used for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. Because we don't have initial associations in our data, it is an unsupervised learning problem for marketing activities such as, e.g., promotional pricing or product placements. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions. [Wikipedia]

Evaluation Metrics¹

Support

: % of transactions where items in X AND Y are bought together
Property of down-ward closure which means that all sub sets of a frequent set (support > min. support threshold) are also frequent
Cons: Items that occur very infrequently in the data set are pruned although they would still produce interesting and potentially valuable rules.

Confidence

: % of transactions amongst all customers who bought Y given that they have bought X
While support is used to prune the search space and only leave potentially interesting rules, confidence is used in a second step to filter rules that exceed a min. confidence threshold
Cons: sensitive to the frequency of the consequent (Y) in the data set. Caused by the way confidence is calculated, Ys with higher support will automatically produce higher confidence values even if they exists no association between the items.

Lift

An association rule X -> Y is only useful if the lift value > 1
Want to consider also the presence of Y being bought independently without knowledge about X
Largely solves to problem of confidence threshold: sensitive to the frequency of the consequent (Y)

Conviction

: How poor can the association be.
A directed measure monotone in confidence and lift.

Leverage

: difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically independent.
The rational in a sales setting is to find out how many more units (items X and Y together) are sold than expected from the independent sells.
Cons: suffer from the rare item problem.

Apriori Property

All subsets of a frequent itemset must be frequent (Apriori propertry). If an itemset is infrequent, all its supersets will be infrequent.

Applying the apriori property, we get the following algorithm.

Algorithm

Generating Support Value for Itemsets containing one items (One Itemset)
With a pre-defined support threshold, identify itemsets worth exploring
With the shortlisted One Itemset that are above the support threshold, generate Itemsets containing two items (Two Itemsets)
With the same pre-definited support threshold, identify associations in Two Itemsets that are worth exploring
With the shortlisted Two Itemsets, association rule is generated between the two items
Confidence value is generated for each association rule
With a pre-defined confidence threshold, association rules are being shortlisted
With shortlisted association rules, the lift values are computed for each of them
Only association rules with lift value > 1 is considered as meaningful associations

Clustering: Apriori

https://criss-wang.github.io/post/blogs/unsupervised/clustering-5/

Author

Zhenlin Wang

Posted on

2020-02-11

Updated on

2021-09-20

Clustering: Apriori

Association Rule

Evaluation Metrics¹

Apriori Property

Algorithm

Author

Posted on

Updated on

Licensed under

Tags

Catalogue

Clustering: Apriori

Association Rule

Evaluation Metrics1

Apriori Property

Algorithm

Author

Posted on

Updated on

Licensed under

Tags

Catalogue

Evaluation Metrics¹