Building flat, tree, and DAG structured clusterings.
Project description
Install
Linux wheels available on pypi:
pip install graphgrove
Building from source:
conda create -n gg python=3.7
conda activate gg
pip install numpy
make
To build your own wheel:
conda create -n gg python=3.7
conda activate gg
pip install numpy
make
pip install build
python -m build --wheel
# which can be used as:
# pip install --force dist/graphgrove-0.0.1-cp37-cp37m-linux_x86_64.whl
Examples
Toy examples of clustering and nearest neighbor search are available.
At a high level, incremental clustering can be done as:
import graphgrove as gg
k = 5
num_rounds = 50
thresholds = np.geomspace(1.0, 0.001, num_rounds).astype(np.float32)
scc = gg.vec_scc.Cosine_SCC(k=k, num_rounds=num_rounds, thresholds=thresholds, index_name='cosine_sgtree', cores=cores, verbosity=0)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
scc.partial_fit(batch)
Incremental nearest neighbor search can be done as:
import graphgrove as gg
k=5
cores=4
tree = gg.graph_builder.Cosine_SGTree(k=k, cores=cores)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
tree.insert(batch) # or tree.insert_and_knn(batch)
Algorithms Implemented
Clustering:
- Sub-Cluster Component Algorithm (SCC) and its minibatch variant from the paper: Scalable Hierarchical Agglomerative Clustering. Nicholas, Monath, Kumar Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork Mert Terzihan Bryon Tjanaka Yuan Wang Yuchen Wu. KDD. 2021
- DAG Structured clustering (LLama) from DAG-Structured Clustering by Nearest Neighbors. Nicholas Monath, Manzil Zaheer, Kumar Avinava Dubey, Amr Ahmed, Andrew McCallum. AISTATS 2021.
Nearest Neighbor Search:
- CoverTree: Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." ICML. 2006.
- SGTree: SG-Tree is a new data structure for exact nearest neighbor search inspired from Cover Tree and its improvement, which has been used in the TerraPattern project. At a high level, SG-Tree tries to create a hierarchical tree where each node performs a "coarse" clustering. The centers of these "clusters" become the children and subsequent insertions are recursively performed on these children. When performing the NN query, we prune out solutions based on a subset of the dimensions that are being queried. This is particularly useful when trying to find the nearest neighbor in highly clustered subset of the data, e.g. when the data comes from a recursive mixture of Gaussians or more generally time marginalized coalscent process . The effect of these two optimizations is that our data structure is extremely simple, highly parallelizable and is comparable in performance to existing NN implementations on many data-sets. Manzil Zaheer, Guru Guruganesh, Golan Levin, Alexander Smola. TerraPattern: A Nearest Neighbor Search Service. 2019.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graphgrove-0.0.8-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.
File metadata
- Download URL: graphgrove-0.0.8-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 6.4 MB
- Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7ccac4b242c9896141695f8b53f9c03548139a150de9e2006700bd4db8fd3ef
|
|
| MD5 |
7c01295944d78259d4f998ccde01ccd8
|
|
| BLAKE2b-256 |
795cb8dc2f13d19b5566c140b42b12997646eafd47ce636ca99172e5c13adcf3
|
File details
Details for the file graphgrove-0.0.8-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.
File metadata
- Download URL: graphgrove-0.0.8-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 6.4 MB
- Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35edac1ef992762703a5c656e82b122bf2a8f2de5e258ce31bbff02fb04ecbfe
|
|
| MD5 |
3b9579b433bcd0dfb5e6af01080d26d7
|
|
| BLAKE2b-256 |
436dc4eca9cd7e755c3f0d96ec2cccd5084d05c40cb475994c364cd91963f593
|
File details
Details for the file graphgrove-0.0.8-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.
File metadata
- Download URL: graphgrove-0.0.8-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 6.4 MB
- Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0eac4f0922ee26c10ff18241e6de6ab3dcc877ec887811d47ec9c919ba5200c
|
|
| MD5 |
3b6527c86bd05163fba31cb71a30d6b7
|
|
| BLAKE2b-256 |
efd67b15cf136487c4be8a67e2fa4beff25540665a273116732bdcbbd1c5f3b4
|