Skip to main content

Building flat, tree, and DAG structured clusterings.

Project description

Install

Linux wheels available on pypi:

pip install graphgrove

Building from source:

conda create -n gg python=3.7
conda activate gg
pip install numpy
make

To build your own wheel:

conda create -n gg python=3.7
conda activate gg
pip install numpy
make
pip install build
python -m build --wheel
# which can be used as:
# pip install --force dist/graphgrove-0.0.1-cp37-cp37m-linux_x86_64.whl 

Examples

Toy examples of clustering and nearest neighbor search are available.

At a high level, incremental clustering can be done as:

import graphgrove as gg
k = 5
num_rounds = 50
thresholds = np.geomspace(1.0, 0.001, num_rounds).astype(np.float32)
scc = gg.vec_scc.Cosine_SCC(k=k, num_rounds=num_rounds, thresholds=thresholds, index_name='cosine_sgtree', cores=cores, verbosity=0)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    scc.partial_fit(batch)

Incremental nearest neighbor search can be done as:

import graphgrove as gg
k=5
cores=4
tree = gg.graph_builder.Cosine_SGTree(k=k, cores=cores)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    tree.insert(batch) # or tree.insert_and_knn(batch) 

Algorithms Implemented

Clustering:

  • Sub-Cluster Component Algorithm (SCC) and its minibatch variant from the paper: Scalable Hierarchical Agglomerative Clustering. Nicholas, Monath, Kumar Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork Mert Terzihan Bryon Tjanaka Yuan Wang Yuchen Wu. KDD. 2021
  • DAG Structured clustering (LLama) from DAG-Structured Clustering by Nearest Neighbors. Nicholas Monath, Manzil Zaheer, Kumar Avinava Dubey, Amr Ahmed, Andrew McCallum. AISTATS 2021.

Nearest Neighbor Search:

  • CoverTree: Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." ICML. 2006.
  • SGTree: SG-Tree is a new data structure for exact nearest neighbor search inspired from Cover Tree and its improvement, which has been used in the TerraPattern project. At a high level, SG-Tree tries to create a hierarchical tree where each node performs a "coarse" clustering. The centers of these "clusters" become the children and subsequent insertions are recursively performed on these children. When performing the NN query, we prune out solutions based on a subset of the dimensions that are being queried. This is particularly useful when trying to find the nearest neighbor in highly clustered subset of the data, e.g. when the data comes from a recursive mixture of Gaussians or more generally time marginalized coalscent process . The effect of these two optimizations is that our data structure is extremely simple, highly parallelizable and is comparable in performance to existing NN implementations on many data-sets. Manzil Zaheer, Guru Guruganesh, Golan Levin, Alexander Smola. TerraPattern: A Nearest Neighbor Search Service. 2019.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

graphgrove-0.0.8-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.12+ x86-64

graphgrove-0.0.8-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.12+ x86-64

graphgrove-0.0.8-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.12+ x86-64

File details

Details for the file graphgrove-0.0.8-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for graphgrove-0.0.8-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 a7ccac4b242c9896141695f8b53f9c03548139a150de9e2006700bd4db8fd3ef
MD5 7c01295944d78259d4f998ccde01ccd8
BLAKE2b-256 795cb8dc2f13d19b5566c140b42b12997646eafd47ce636ca99172e5c13adcf3

See more details on using hashes here.

File details

Details for the file graphgrove-0.0.8-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for graphgrove-0.0.8-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 35edac1ef992762703a5c656e82b122bf2a8f2de5e258ce31bbff02fb04ecbfe
MD5 3b9579b433bcd0dfb5e6af01080d26d7
BLAKE2b-256 436dc4eca9cd7e755c3f0d96ec2cccd5084d05c40cb475994c364cd91963f593

See more details on using hashes here.

File details

Details for the file graphgrove-0.0.8-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for graphgrove-0.0.8-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c0eac4f0922ee26c10ff18241e6de6ab3dcc877ec887811d47ec9c919ba5200c
MD5 3b6527c86bd05163fba31cb71a30d6b7
BLAKE2b-256 efd67b15cf136487c4be8a67e2fa4beff25540665a273116732bdcbbd1c5f3b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page