Skip to main content

Building flat, tree, and DAG structured clusterings.

Project description

Install

Linux wheels available (python >=3.8) on pypi:

pip install graphgrove

Building from source:

conda create -n gg python=3.8
conda activate gg
pip install numpy
make

To build your own wheel:

conda create -n gg python=3.8
conda activate gg
pip install numpy
make
pip install build
python -m build --wheel
# which can be used as:
# pip install --force dist/graphgrove-0.0.1-cp37-cp37m-linux_x86_64.whl 

Examples

Toy examples of clustering, DAG-structured clustering, and nearest neighbor search are available.

At a high level, incremental clustering can be done as:

import graphgrove as gg
k = 5
num_rounds = 50
thresholds = np.geomspace(1.0, 0.001, num_rounds).astype(np.float32)
scc = gg.vec_scc.Cosine_SCC(k=k, num_rounds=num_rounds, thresholds=thresholds, index_name='cosine_sgtree', cores=cores, verbosity=0)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    scc.partial_fit(batch)

Incremental nearest neighbor search can be done as:

import graphgrove as gg
k=5
cores=4
tree = gg.graph_builder.Cosine_SGTree(k=k, cores=cores)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    tree.insert(batch) # or tree.insert_and_knn(batch) 

Algorithms Implemented

Clustering:

  • Sub-Cluster Component Algorithm (SCC) and its minibatch variant from the paper: Scalable Hierarchical Agglomerative Clustering. Nicholas, Monath, Kumar Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork Mert Terzihan Bryon Tjanaka Yuan Wang Yuchen Wu. KDD. 2021
  • DAG Structured clustering (LLama) from DAG-Structured Clustering by Nearest Neighbors. Nicholas Monath, Manzil Zaheer, Kumar Avinava Dubey, Amr Ahmed, Andrew McCallum. AISTATS 2021.

Nearest Neighbor Search:

  • CoverTree: Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." ICML. 2006.
  • SGTree: SG-Tree is a new data structure for exact nearest neighbor search inspired from Cover Tree and its improvement, which has been used in the TerraPattern project. At a high level, SG-Tree tries to create a hierarchical tree where each node performs a "coarse" clustering. The centers of these "clusters" become the children and subsequent insertions are recursively performed on these children. When performing the NN query, we prune out solutions based on a subset of the dimensions that are being queried. This is particularly useful when trying to find the nearest neighbor in highly clustered subset of the data, e.g. when the data comes from a recursive mixture of Gaussians or more generally time marginalized coalscent process . The effect of these two optimizations is that our data structure is extremely simple, highly parallelizable and is comparable in performance to existing NN implementations on many data-sets. Manzil Zaheer, Guru Guruganesh, Golan Levin, Alexander Smola. TerraPattern: A Nearest Neighbor Search Service. 2019.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

graphgrove-0.0.11-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.12+ x86-64

graphgrove-0.0.11-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.12+ x86-64

graphgrove-0.0.11-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.12+ x86-64

File details

Details for the file graphgrove-0.0.11-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for graphgrove-0.0.11-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 4e1eb4e781d0409073f48449679f2d027050956976e6ef8c4dd76f0979a1bab3
MD5 6648815e45fed5e1b36381b8e6110bcd
BLAKE2b-256 cf0c2f626f1771f07b779e21b37ffb0fe2766baf158f4f39d75593a2d95b4764

See more details on using hashes here.

File details

Details for the file graphgrove-0.0.11-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for graphgrove-0.0.11-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 d4b4818c7f3ed0fa04d3382843a55e7454f9efc3d7c5c34ac07a5821c363827f
MD5 989b11f67ffe84c8713f6da271a2f114
BLAKE2b-256 2bf235278ac30bb7c377b024790a22edbda5cdf35c04d095fd0e71cbaf3c63a4

See more details on using hashes here.

File details

Details for the file graphgrove-0.0.11-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for graphgrove-0.0.11-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 7d919b71493375868ce96b00cddf126350f53c1c388884bcaf2221123e16af18
MD5 e76d17fc400bc300108a041ba9665586
BLAKE2b-256 6024c3e00d3f61155e92a7615a9945b47a2995ee1d711b0a26fecf78296d9f6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page