Highly efficient set statistics about many-to-many relationships
Project description
mtm_stats
Highly efficient set statistics about many-to-many relationships
This module computes statistics (counts) in the common case where two sets
(A and B) have a many-to-many relationship. Specifically, it computes:
* *connection count* (or just *count*): the number of connections a member of A has to B
* *intersection count*: the number of connections to B that two members of A share in common
This module aims to compute these for every possible combination of
members in A as fast as possible and without approximation.
The input for this module is a list of A/B pairs, i.e.:
[(a1, b1), (a1, b2), (a2, b2), ...]
If N_A is the length of A, the connection count will have size N_A and
the intersection count will have size N_A * (N_A - 1) / 2.
Any other set property can be easily derived from these.
* union_count(A1, A2) = count(A1) + count(A2) - intersection_count(A1, A2)
* single_sided_difference_count(A1, A2) = count(A1) - intersection(A1, A2)
* symmetric_difference_count(A1, A2) = union_count(A1, A2) - intersection_count(A1, A2)
The union is actually computed in the main "mtm_stats" function in "get_dicts_from_array_outputs"
This approach will be the most effective when the connections between A and B are sparse.
In addition, there is one optional approximation that can be used to enhance performance:
a cutoff filter for small intersections (default is 0).
If cutoff > 0, any result with intersection count <= cutoff will be approximated as 0.
The union count is then assumed to be count(A1) + count(A2).
### Implementation details:
This module uses a combination of techniques to achieve this goal:
* Core implemented in C and Cython
* Bit packing (members of B are each assigned one bit)
* A sparse block array compression scheme that ignores large blocks of zeros in the bit-packed sets
* Efficiently applied bit operations (&, |, popcount)
* Only storing nonzero intersections using a 2-index sparse array: [(A1, A2, intersection_count), ...]
* The optional cutoff described above
These techniques not only give huge time savings over a brute-force
implementation, but also cut down on memory usage
(which could easily crash a machine with a sets as small as 100K elements).
### Alternative techniques
This technique will work well until the size of N_A^2 becomes unmanagely large.
In a case like that, you should try a probabalistic algorithm like HyperLogLog, Count Min Sketch or MinHash:
* http://bravenewgeek.com/tag/count-min-sketch/
* https://redislabs.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff#.V-6Mctz3aaM
* https://tech.shareaholic.com/2012/12/03/the-count-min-sketch-how-to-count-over-large-keyspaces-when-about-right-is-good-enough/
Some sample packages to do these kinds of operations are algebird and datasketch:
* https://github.com/twitter/algebird
* https://github.com/ekzhu/datasketch
### Installation instructions:
Under Debian systems, all you need to do is the following (or similar):
* sudo apt-get install -y build-essential cmake g++ python python-dev python-setuptools python-numpy
* sudo easy_install pip
* sudo pip install numpy Cython
* sudo pip install mtm_stats
### Special instructions for installation on Mac OS X
This packages uses OpenMP to automatically utilize all available processing cores.
For this reason, you will need to get gcc (and not clang which lacks OpenMP support).
##### Install Homebrew from https://brew.sh/
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
##### Install gcc and cmake
brew install gcc cmake
##### Install numpy and Cython
pip install numpy Cython
##### Set compiler commands and include path environment variables (versions may need to be updated)
export CC=/usr/local/Cellar/gcc/6.3.0_1/bin/gcc-6
export CXX=/usr/local/Cellar/gcc/6.3.0_1/bin/g++-6
export C_INCLUDE_PATH=/Users/developers/.virtualenvs/learning_intelligence_data/lib/python2.7/site-packages/numpy/core/include
##### Install the library
pip install mtm_stats
Python 3 is supported, provided the same dependencies are met.
Highly efficient set statistics about many-to-many relationships
This module computes statistics (counts) in the common case where two sets
(A and B) have a many-to-many relationship. Specifically, it computes:
* *connection count* (or just *count*): the number of connections a member of A has to B
* *intersection count*: the number of connections to B that two members of A share in common
This module aims to compute these for every possible combination of
members in A as fast as possible and without approximation.
The input for this module is a list of A/B pairs, i.e.:
[(a1, b1), (a1, b2), (a2, b2), ...]
If N_A is the length of A, the connection count will have size N_A and
the intersection count will have size N_A * (N_A - 1) / 2.
Any other set property can be easily derived from these.
* union_count(A1, A2) = count(A1) + count(A2) - intersection_count(A1, A2)
* single_sided_difference_count(A1, A2) = count(A1) - intersection(A1, A2)
* symmetric_difference_count(A1, A2) = union_count(A1, A2) - intersection_count(A1, A2)
The union is actually computed in the main "mtm_stats" function in "get_dicts_from_array_outputs"
This approach will be the most effective when the connections between A and B are sparse.
In addition, there is one optional approximation that can be used to enhance performance:
a cutoff filter for small intersections (default is 0).
If cutoff > 0, any result with intersection count <= cutoff will be approximated as 0.
The union count is then assumed to be count(A1) + count(A2).
### Implementation details:
This module uses a combination of techniques to achieve this goal:
* Core implemented in C and Cython
* Bit packing (members of B are each assigned one bit)
* A sparse block array compression scheme that ignores large blocks of zeros in the bit-packed sets
* Efficiently applied bit operations (&, |, popcount)
* Only storing nonzero intersections using a 2-index sparse array: [(A1, A2, intersection_count), ...]
* The optional cutoff described above
These techniques not only give huge time savings over a brute-force
implementation, but also cut down on memory usage
(which could easily crash a machine with a sets as small as 100K elements).
### Alternative techniques
This technique will work well until the size of N_A^2 becomes unmanagely large.
In a case like that, you should try a probabalistic algorithm like HyperLogLog, Count Min Sketch or MinHash:
* http://bravenewgeek.com/tag/count-min-sketch/
* https://redislabs.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff#.V-6Mctz3aaM
* https://tech.shareaholic.com/2012/12/03/the-count-min-sketch-how-to-count-over-large-keyspaces-when-about-right-is-good-enough/
Some sample packages to do these kinds of operations are algebird and datasketch:
* https://github.com/twitter/algebird
* https://github.com/ekzhu/datasketch
### Installation instructions:
Under Debian systems, all you need to do is the following (or similar):
* sudo apt-get install -y build-essential cmake g++ python python-dev python-setuptools python-numpy
* sudo easy_install pip
* sudo pip install numpy Cython
* sudo pip install mtm_stats
### Special instructions for installation on Mac OS X
This packages uses OpenMP to automatically utilize all available processing cores.
For this reason, you will need to get gcc (and not clang which lacks OpenMP support).
##### Install Homebrew from https://brew.sh/
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
##### Install gcc and cmake
brew install gcc cmake
##### Install numpy and Cython
pip install numpy Cython
##### Set compiler commands and include path environment variables (versions may need to be updated)
export CC=/usr/local/Cellar/gcc/6.3.0_1/bin/gcc-6
export CXX=/usr/local/Cellar/gcc/6.3.0_1/bin/g++-6
export C_INCLUDE_PATH=/Users/developers/.virtualenvs/learning_intelligence_data/lib/python2.7/site-packages/numpy/core/include
##### Install the library
pip install mtm_stats
Python 3 is supported, provided the same dependencies are met.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
mtm_stats-0.4.4.3.tar.gz
(80.0 kB
view details)
File details
Details for the file mtm_stats-0.4.4.3.tar.gz
.
File metadata
- Download URL: mtm_stats-0.4.4.3.tar.gz
- Upload date:
- Size: 80.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.9.1 pkginfo/1.4.1 requests/2.11.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.15.0 CPython/2.7.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05ff18c6f1307d22ad718419215b1f8dd011bb23e6e08e646a0a730934394b82 |
|
MD5 | 8f8e2035c84b72967095dbba4a7894f5 |
|
BLAKE2b-256 | e7cdf51b7c3218960f08a5d8dd38b01f7d4c80a2efbcfb59df9ef00231815dc2 |