mtm_stats

Highly efficient set statistics about many-to-many relationships

Project description

mtm_stats
Highly efficient set statistics about many-to-many relationships

This module computes statistics (counts) in the common case where two sets
(A and B) have a many-to-many relationship. Specifically, it computes:

* *connection count* (or just *count*): the number of connections a member of A has to B
* *intersection count*: the number of connections to B that two members of A share in common

This module aims to compute these for every possible combination of
members in A as fast as possible and without approximation.

The input for this module is a list of A/B pairs, i.e.:
[(a1, b1), (a1, b2), (a2, b2), ...]

If N_A is the length of A, the connection count will have size N_A and
the intersection count will have size N_A * (N_A - 1) / 2.

Any other set property can be easily derived from these.

* union_count(A1, A2) = count(A1) + count(A2) - intersection_count(A1, A2)
* single_sided_difference_count(A1, A2) = count(A1) - intersection(A1, A2)
* symmetric_difference_count(A1, A2) = union_count(A1, A2) - intersection_count(A1, A2)

The union is actually computed in the main "mtm_stats" function in "get_dicts_from_array_outputs"

This approach will be the most effective when the connections between A and B are sparse.

In addition, there is one optional approximation that can be used to enhance performance:
a cutoff filter for small intersections (default is 0).
If cutoff > 0, any result with intersection count <= cutoff will be approximated as 0.
The union count is then assumed to be count(A1) + count(A2).

Implementation details:
This module uses a combination of techniques to achieve this goal:
* Core implemented in C and Cython
* Bit packing (members of B are each assigned one bit)
* A sparse block array compression scheme that ignores large blocks of zeros in the bit-packed sets
* Efficiently applied bit operations (&, |, popcount)
* Only storing nonzero intersections using a 2-index sparse array: [(A1, A2, intersection_count), ...]
* The optional cutoff described above

These techniques not only give huge time savings over a brute-force
implementation, but also cut down on memory usage
(which could easily crash a machine with a sets as small as 100K elements).

💥

This technique will work well until the size of N_A^2 becomes unmanagely large.
In a case like that, you should try a probabalistic algorithm like HyperLogLog, Count Min Sketch or MinHash:
* http://bravenewgeek.com/tag/count-min-sketch/
* https://redislabs.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff#.V-6Mctz3aaM
* https://tech.shareaholic.com/2012/12/03/the-count-min-sketch-how-to-count-over-large-keyspaces-when-about-right-is-good-enough/

Some sample packages to do these kinds of operations are algebird and datasketch:
* https://github.com/twitter/algebird
* https://github.com/ekzhu/datasketch

Project details

Release history Release notifications | RSS feed

0.4.4.3

Oct 19, 2018

0.4.4.2

Oct 16, 2018

0.4.4.1

Apr 28, 2017

0.4.4.0

Apr 27, 2017

0.4.3.0

Feb 23, 2017

0.4.2.1

Feb 21, 2017

This version

0.4.1.0

Feb 13, 2017

0.4.0.0

Feb 8, 2017

0.3.4.1

Feb 7, 2017

0.3.4.0

Feb 7, 2017

0.3.3.0

Jan 15, 2017

0.3.2.0

Jan 15, 2017

0.3.1.3

Dec 13, 2016

0.3.1.2

Dec 9, 2016

0.3.1.1

Nov 1, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mtm_stats-0.4.1.0.tar.gz (71.3 kB view details)

Uploaded Feb 13, 2017 Source

File details

Details for the file mtm_stats-0.4.1.0.tar.gz.

File metadata

Download URL: mtm_stats-0.4.1.0.tar.gz
Upload date: Feb 13, 2017
Size: 71.3 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for mtm_stats-0.4.1.0.tar.gz
Algorithm	Hash digest
SHA256	`98084cdf1f1ae2b980666020d2e1ebfc4bb5f11fe6c58cc12f5148ec92d878e4`
MD5	`b29a78f493f2b6d53705f86e73e970c7`
BLAKE2b-256	`016c21d8699a77d90604ea710877beda548a557e7f046454ae2ce5b77fc14d32`

See more details on using hashes here.

mtm_stats 0.4.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta