Association Analysis Package
Project description
apriori
Pip installable association analysis package.
Our datasets is a list of transactions, individual people that 'purchased' items together. We are trying to find things that are associated with each other.
Installation
pip install kiwi-apriori
It seems someone already has apriori
on Twine. The package will be used by
import apriori
.
Quick example
This dataset looks at the purchases of 4 customers:
transactions = [
['fish', 'white wine', 'cheese', 'bread'],
['beer', 'nachos', 'cheese', 'peanuts'],
['white wine', 'cheese'],
['white wine', 'cheese', 'bread']
]
We are interested in which items are associated with one another. It seems that people that bought white wine also bought cheese. Here are some "association rules" we find:
If you buy ... | ... we think you'd like | Support | Confidence | Lift |
---|---|---|---|---|
white wine | bread | 0.5000 | 0.6667 | 1.3333 |
bread | white wine | 0.5000 | 1.0000 | 1.3333 |
white wine | bread AND cheese | 0.5000 | 0.6667 | 1.3333 |
bread | cheese AND white wine | 0.5000 | 1.0000 | 1.3333 |
cheese AND white wine | bread | 0.5000 | 0.6667 | 1.3333 |
bread AND cheese | white wine | 0.5000 | 1.0000 | 1.3333 |
Let's look at the first two rows to illustrate support, confidence, and lift.
- Support: Fractions of transactions containing these items. i.e. the probability a randomly chosen transaction contains all the items.
- The first two rows are about
{white wine}
and{bread}
. Half of our transactions contain both these items (transactions[0]
amdtransactions[3]
) - Order doesn't matter, the support of
{white wine}
-->{bread}
and{bread}
-->{white wine}
are the same - Can be applied to single items (e.g. support for
{white wine}
is 0.75, as 3/4 transactions contain white wine)
- The first two rows are about
- Confidence of A --> B: Fraction of transactions containing A that also contain B. i.e. the probability that a randomly chosen transaction containing A also contains B
- 2 of the 3 transactions containing
{white wine}
also contain{bread}
, so confidence ofwhite wine -> bread
is 0.667 - 2 of the 2 transactions containing
{bread}
also contain{white wine}
, so confidence ofbread -> white wine
is 1.000 - Can be calculated as
support(A AND B)/support(A)
- 2 of the 3 transactions containing
- Lift of A --> B: The support of A AND B divided by the support of A and the support of B
- Measures the amount of information you get knowing A
- Knowing the lift of white wine to bread is 1.3333 tells us that someone buying white wine is 1.3333 times more likely than the average person to buy bread as well.
- If A and B were independent, Lift would be 1
- If the lift is less than one, that tells us someone is less likely than the average person to buy the other item.
Quickstart
Taken from example.py
import apriori
sample_transactions = [
['fish', 'white wine', 'cheese', 'bread'],
['beer', 'nachos', 'cheese', 'peanuts'],
['white wine', 'cheese'],
['white wine', 'cheese', 'bread']
]
for rule in apriori.generate_rules(sample_transactions, min_support=0.5):
msg = (f'{rule.format_rule():20s}\t\t'
f'(support={rule.support:0.4f}, confidence={rule.confidence:0.4f}, lift={rule.lift:0.4f})')
print(msg)
Output:
{white wine} ---> {cheese} (support=0.7500, confidence=1.0000, lift=1.0000)
{cheese} ---> {white wine} (support=0.7500, confidence=0.7500, lift=1.0000)
{white wine} ---> {bread} (support=0.5000, confidence=0.6667, lift=1.3333)
{bread} ---> {white wine} (support=0.5000, confidence=1.0000, lift=1.3333)
{cheese} ---> {bread} (support=0.5000, confidence=0.5000, lift=1.0000)
{bread} ---> {cheese} (support=0.5000, confidence=1.0000, lift=1.0000)
{white wine} ---> {bread,cheese} (support=0.5000, confidence=0.6667, lift=1.3333)
{cheese} ---> {bread,white wine} (support=0.5000, confidence=0.5000, lift=1.0000)
{bread} ---> {cheese,white wine} (support=0.5000, confidence=1.0000, lift=1.3333)
{cheese,white wine} ---> {bread} (support=0.5000, confidence=0.6667, lift=1.3333)
{bread,white wine} ---> {cheese} (support=0.5000, confidence=1.0000, lift=1.0000)
{bread,cheese} ---> {white wine} (support=0.5000, confidence=1.0000, lift=1.3333)
Alternative approaches
This is one approach to determining "this-goes-with-that". Here are comparisons to other alternatives
Graph / Network models
We could model each item as a node in a graph, and put a weighted edge between items that appear in the same transaction. In our example, "white wine" and "cheese" would both be nodes, and be joined by a weight of 3. We could then do a community detection algorithm to find things that are associated with one another, or look for hubs (items that are associated with a lot of different purchases).
Pros
- Algorithm is a lot faster (only deal with $N$ items, and the $O(N^2)$ relationships between items, not all possible subsets of items).
Cons
- Edges are formed between pairs of nodes, so this approach tells you about pairs of items that are associated. It doesn't find cases like "everyone who bought all three of A, B, and C also bought D".
Matrix Factorization
We are focusing on items that are associated with each other; we can score them by one-hot encoding the different transactions and finding the dot product between pairs of items. This score will be the number of transactions both items coexist in.
Pros
- A lot faster
- Can be trained online / streaming
- Lots of support
Cons
- Popular items have higher scores (can mitigate this using cosine distance)
- Limited to comparing pairs of items (although you can try and make a lower dimensional representation and see which items have a similar representation in this lower dimensional space)
Max-Miner
Unlike the other two algorithms, this one is specifically designed for this problem. It deals with sets of items, not just pairs of items. It also deals with the drawback of this algorithm being slow.
Pros
- A lot faster than apriori
- Deals with sets of items, rather than pairs of items
Cons
- Finds most, but not all, association rules
Other applications
Although the examples have all been framed in terms of items bought in transactions with one another, this algorithm can be used to find what sets of things "belong together", similar to clustering. Clustering works by looking at features of individual points, and grouping items with similar features together. Association analysis doesn't give the items any individual features, but clusters on the similarities of the relationships.
Other examples:
- Finding voting patterns (if you vote for A, B, and C, you are likely to vote for D). Transactions are representatives, items are bills with "yea" votes.
- Finding associated medical conditions that may have an underlying root causes. Transactions are patients, items are treatments or symptoms.
- Finding outfits. Transactions are things that were worn together, items are individual pieces of clothing. (This is an interesting problem, as two shirts have more in common than a shirt and trousers based on features, but you wouldn't wear two shirts and no trousers).
References
- Wikipedia has great psuedo-code on the apriori algorithm
- There is another pip installable package, apyrori and effecient apriori
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for kiwi_apriori-0.3.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d13db306fd4a551f227d77f7cdc44b79459203afc61771dc8094d634fa15468 |
|
MD5 | f51c547b03c840b0d6cd62261b5a8678 |
|
BLAKE2b-256 | 53c9d6c2764998026e26b8a95c083abace6cac06a1141b8ff88822d390f9a900 |