Skip to main content

Billion-scale molecular clustering and visualization using Product Quantization and nested TMAPs.

Project description

Chelombus

License: MIT Version Python

Billion-scale molecular clustering and visualization on commodity hardware.

Chelombus enables interactive exploration of ultra-large chemical datasets (up to billions of molecules) using Product Quantization and nested TMAPs. Process the entire Enamine REAL database (9.6B molecules) on a single workstation.

Live Demo: https://chelombus.gdb.tools

Overview

Chelombus implements the "Nested TMAP" framework for visualizing billion-sized molecular datasets:

SMILES → MQN Fingerprints → PQ Encoding → PQk-means Clustering → Nested TMAPs

Key Features:

  • Scalability: Stream billions of molecules without loading everything into memory
  • Efficiency: Compress 42-dimensional MQN vectors to 6-byte PQ codes (28x compression)
  • GPU acceleration: Optional CUDA support for PQ encoding and cluster assignment (~25x speedup)
  • Visualization: Navigate from global overview to individual molecules in two clicks
  • Accessibility: Runs on commodity hardware (tested: AMD Ryzen 7, 64GB RAM)

Installation

From PyPI (recommended)

pip install chelombus

From Source

git clone https://github.com/afloresep/chelombus.git
cd chelombus
pip install -e .

Platform Notes

Apple Silicon (M1/M2/M3): The pqkmeans library is not currently supported on Apple Silicon Macs. My plan is to rewrite pqkmeans with Silicon and GPU support but that's for a future release... For now, clustering functionality requires an x86_64 system.

GPU Acceleration

Every step in the pipeline supports optional GPU acceleration via the device parameter: encoder training, PQ encoding, cluster training, and label assignment. When a CUDA GPU is available, device='auto' (the default) uses the GPU transparently; otherwise it falls back to CPU.

Requirements: torch and triton (both installed with pip install torch).

encoder = PQEncoder(k=256, m=6, iterations=20)
encoder.fit(training_fps, device='auto')          # GPU KMeans fitting
pq_codes = encoder.transform(fingerprints)        # GPU batch assignment

clusterer = PQKMeans(encoder, k=100000)
clusterer.fit(pq_codes)                           # GPU Triton assign + CPU centroid update
labels = clusterer.predict(pq_codes)              # GPU Triton kernel

# Or force a specific device
labels_cpu = clusterer.predict(pq_codes, device='cpu')
labels_gpu = clusterer.predict(pq_codes, device='gpu')

GPU benchmarks on 1B Enamine REAL molecules (RTX 4070 Ti SUPER 16GB, K=100,000):

Stage Description Time
Encoder training Train codebook on 50M MQN fingerprints 1.8 min
PQ encoding Encode 1B MQN fingerprints → 1B PQ codes (6-dim) 3.7 min
Cluster training Train PQKMeans on 1B PQ codes (5 iters, tol=0) 2.3 hrs
Label assignment Assign 1B PQ codes to 100K clusters 26.2 min
Total 2.9 hrs

Cluster training with the default tol=1e-3 converges earlier (changed centers dropped from 11.8% → 0.2% over 5 iterations), reducing training time further. Extrapolated to 10B molecules: ~1.2 days.

To reproduce (requires 1B MQN fingerprints as .npy chunks):

python scripts/benchmark_1B_pipeline.py --chunks /path/to/chunks --n 1000000000

The GPU implementation uses a custom Triton kernel for cluster assignment that tiles over centers with an online argmin, never materializing the N x K distance matrix. The kernel supports any number of subvectors M via compile-time unrolling (tl.static_range). VRAM usage is ~(M+4) bytes/point, so even an 8 GB GPU can process hundreds of millions of points per batch.

Quick Start

from chelombus import DataStreamer, FingerprintCalculator, PQEncoder, PQKMeans

# 1. Stream SMILES in chunks
streamer = DataStreamer(path='molecules.smi', chunksize=100000)

# 2. Calculate MQN fingerprints
fp_calc = FingerprintCalculator()
for smiles_chunk in streamer.parse_input():
    fingerprints = fp_calc.FingerprintFromSmiles(smiles_chunk, fp='mqn')
    # Save fingerprints...

# 3. Train PQ encoder on sample
encoder = PQEncoder(k=256, m=6, iterations=20)
encoder.fit(training_fingerprints)

# 4. Transform all fingerprints to PQ codes
pq_codes = encoder.transform(fingerprints)

# 5. Cluster with PQk-means
clusterer = PQKMeans(encoder, k=100000)
labels = clusterer.fit_predict(pq_codes)

Project Structure

chelombus/
├── chelombus/
│   ├── encoder/          # Product Quantization encoder
│   ├── clustering/       # PQk-means wrapper
│   ├── streamer/         # Memory-efficient data streaming
│   └── utils/            # Fingerprints, visualization, helpers
├── scripts/              # Pipeline scripts
├── examples/             # Tutorial notebooks
└── tests/                # Unit tests

Choosing k (Number of Clusters)

The scripts/select_k.py script sweeps over k values on a subsample to help pick the right number of clusters. It supports checkpointing, if interrupted, rerun the same command and it resumes from where it left off.

python scripts/select_k.py \
    --pq-codes data/pq_codes.npy \
    --encoder models/encoder.joblib \
    --n-subsample 10000000 \
    --k-values 10000 25000 50000 100000 200000 \
    --iterations 10 \
    --output results/k_selection.csv \
    --plot results/k_selection.png

Results on 100M Enamine REAL molecules (RTX 4070 Ti SUPER 16GB, AMD Ryzen 7 64GB RAM):

k Avg Distance Empty Clusters Median Cluster Size Fit Time (GPU) Fit Time (CPU)
10,000 3.67 7.1% 9,061 1.7 min 1.3 h
25,000 2.74 13.5% 3,680 3.3 min 3.1 h
50,000 2.17 19.5% 1,879 6.0 min 6.2 h
100,000 1.69 26.7% 960 9.1 min 12.6 h
200,000 1.30 34.7% 492 17.6 min 26.4 h

Guidelines:

  • k = 50,000 is a good default — under 20% empty clusters, median size ~1,900, and the avg distance improvement starts plateauing beyond this point.
  • k = 100,000 if you need tighter clusters and can tolerate ~27% empty clusters.
  • Beyond 200K, over a third of clusters are empty — diminishing returns.

Documentation

  • Full docs: https://chelombus.gdb.tools
  • Tutorial: See examples/tutorial.ipynb for a hands-on introduction
  • Large-scale example: See examples/enamine_1B_clustering.ipynb
  • API Reference: See docs/api.md or the hosted docs

Testing

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_encoder.py -v

Citation

If you use Chelombus in your research, please cite:

@article{chelombus2025,
  title={Nested TMAPs to visualize Billions of Molecules},
  author={Flores Sepulveda, Alejandro and Reymond, Jean-Louis},
  journal={},
  year={2025}
}

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Write tests for new functionality
  4. Submit a pull request

License

MIT License. See LICENSE for details.

Acknowledgments

  • PQk-means by Matsui et al.
  • TMAP by Probst & Reymond
  • RDKit for cheminformatics functionality
  • Swiss National Science Foundation (grant no. 200020_178998)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chelombus-0.2.2.tar.gz (46.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chelombus-0.2.2-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file chelombus-0.2.2.tar.gz.

File metadata

  • Download URL: chelombus-0.2.2.tar.gz
  • Upload date:
  • Size: 46.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chelombus-0.2.2.tar.gz
Algorithm Hash digest
SHA256 57f50e1f580cf3bb9e54d38209f236a9828392dbdb12d12c0a7ae80821c622f1
MD5 d8b76c5af18fc877051a6290b8dc1bce
BLAKE2b-256 d6c0f1883b5007dd061cfc2036d586eec7bf0e1ad9e879247a882437a4fd4c73

See more details on using hashes here.

Provenance

The following attestation bundles were made for chelombus-0.2.2.tar.gz:

Publisher: publish.yml on afloresep/Chelombus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chelombus-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: chelombus-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chelombus-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f2877300919d39ad6145193c3e4e46b395e6958c1919fd2099ce5bb0e2267968
MD5 2c1d276c9f127f1b586c8984fc979c49
BLAKE2b-256 2afb9e47c24686f016485ee8ece3e965061907ac94f38642540a87c1f9226d30

See more details on using hashes here.

Provenance

The following attestation bundles were made for chelombus-0.2.2-py3-none-any.whl:

Publisher: publish.yml on afloresep/Chelombus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page