fastlvm -- fast search, clustering, and mixture modelling
Project description
Fast Sampling for Latent Variable Models
We present implementation of following latent variable models suitable for large scale deployment:
CoverTree
- Fast nearest neighbour searchKMeans
- Simple, fast, and distributed clustering with option of various initializationGMM
- Fast and distributed inference for Gaussian Mixture Models with diagonal covariance matricesLDA
- Fast and distributed inference for Latent Dirichlet AllocationGLDA
- Fast and distributed inference for Gaussian LDA with diagonal covariance matricesHDP
- Fast inference for Hierarchical Dirichlet Process
Under active development
Organisation
- All codes are under
src
within respective folder - Dependencies are provided under
lib
folder - Python wrapper classes reside in
fastlvm
folder - For running different models an example script is provided under
scripts
data
is a placeholder folder where to put the databuild
anddist
folder will be created to hold the executables
Requirements
- gcc >= 5.0 or Intel® C++ Compiler 2017 for using C++14 features
- Python 3.6+
- Mac OS 10.12 or higher(for Mac version only)
How to use
There are two ways to utilize the package: using Python wrapper or directly in C++
Python
Through Pypi
```pip install fastlvm```
On Mac OS:
```CFLAGS=-mmacosx-version-min=10.12 CXXFLAGS=-mmacosx-version-min=10.12 pip install fastlvm```
Manually
If installing from the source on github...
Just use python setup.py install
and then in python you can import fastlvm
. Example and test code is in test.py
.
API
The python API details are provided in API.pdf
, but all of the models utilise the following structure:
class LVM:
init(self, # hyperparameters)
return model
fit(self, X, ...):
return validation score
predict(self, X):
return prediction on each test example
evaluate(self, X):
return test score
If you do not have root priveledges, install with python setup.py install --user
and make sure to have the folder in path.
C++
We will show how to compile our package and run, for example nearest neighbour search using cover trees, on a single machine using synthetic dataset
-
First of all compile by hitting make
make
-
Generate synthetic dataset
python data/generateData.py
-
Run Cover Tree
dist/cover_tree data/train_100d_1000k_1000.dat data/test_100d_1000k_10.dat
The make file has some useful features:
-
if you have Intel® C++ Compiler, then you can instead
make intel
-
or if you want to use Intel® C++ Compiler's cross-file optimization (ipo), then hit
make inteltogether
-
Also you can selectively compile individual modules by specifying
make <module-name>
-
or clean individually by
make clean-<module-name>
Performance
Unit testing
cd tests
python -m unittest discover # requires unittest 3.2 and newer
Attributions
We use a distributed and parallel extension and implementation of Cover Tree data structure for nearest neighbour search. The data structure was originally presented in and improved in:
- Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
- Mike Izbicki and Christian Shelton. "Faster cover trees." Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.
We implement a modified inference for Gaussian LDA. The original model was presented in:
- Rajarshi Das, Manzil Zaheer, Chris Dyer. "Gaussian LDA for Topic Models with Word Embeddings." Proceedings of ACL (pp. 795-804) 2015.
We implement a modified inference for Hierarchical Dirichlet Process. The original model and inference methods were presented in:
- Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(576):1566{1581, 2006.
- C. Chen, L. Du, and W.L. Buntine. Sampling table configurations for the hierarchical poisson-dirichlet process. In European Conference on Machine Learning, pages 296-311. Springer, 2011.
Troubleshooting
If the build fails and throws error like "instruction not found", then most probably the system does not support AVX2 instruction sets. To solve this issue, in setup.py
and src/cover_tree/makefile
please change march=core-avx2
to march=corei7
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fastlvm-3.1.1.tar.gz
.
File metadata
- Download URL: fastlvm-3.1.1.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.49.0 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e962d1dccbbe11b74c79fccd0021af0c870c52e1da222bab6af130b0b0f30442 |
|
MD5 | 910018db8c32aa375fba72882a354521 |
|
BLAKE2b-256 | 5d7d7b1e0f83648286c9a086f74b7c6c6991739b77b544c280f7f86081eabfb0 |
File details
Details for the file fastlvm-3.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: fastlvm-3.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 17.6 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.49.0 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6c85e25137d941abae84809d97ee4049b99fc1e64848821c523219f94a9371f |
|
MD5 | e757adb91e94dc7c8314b54295a7f43d |
|
BLAKE2b-256 | 4f4954ae97446d22c533fc4cc9a9666e4dbd7b269590ae104ee30e6172ac0023 |