Skip to main content

Billion-scale molecular clustering and visualization using Product Quantization and nested TMAPs.

Project description

Chelombus

License: MIT Version Python

Billion-scale molecular clustering and visualization on commodity hardware.

Chelombus enables interactive exploration of ultra-large chemical datasets (up to billions of molecules) using Product Quantization and nested TMAPs. Process the entire Enamine REAL database (9.6B molecules) on a single workstation.

Live Demo: https://chelombus.gdb.tools

Overview

Chelombus implements the "Nested TMAP" framework for visualizing billion-sized molecular datasets:

SMILES → MQN Fingerprints → PQ Encoding → PQk-means Clustering → Nested TMAPs

Key Features:

  • Scalability: Stream billions of molecules without loading everything into memory
  • Efficiency: Compress 42-dimensional MQN vectors to 6-byte PQ codes (28x compression)
  • Visualization: Navigate from global overview to individual molecules in two clicks
  • Accessibility: Runs on commodity hardware (tested: AMD Ryzen 7, 64GB RAM)

Installation

# Clone the repository
git clone https://github.com/afloresep/chelombus.git
cd chelombus

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install with core dependencies
pip install -e .

# Install with clustering support
pip install -e ".[clustering]"

# Install with visualization support
pip install -e ".[visualization]"

# Install everything
pip install -e ".[all]"

Platform Notes

Apple Silicon (M1/M2/M3): The pqkmeans library is not currently supported on Apple Silicon Macs. My plan is to rewrite pqkmeans with Silicon and GPU support but that's for a future release... For now, clustering functionality requires an x86_64 system.

Quick Start

from chelombus import DataStreamer, FingerprintCalculator, PQEncoder, PQKMeans

# 1. Stream SMILES in chunks
streamer = DataStreamer(path='molecules.smi', chunksize=100000)

# 2. Calculate MQN fingerprints
fp_calc = FingerprintCalculator()
for smiles_chunk in streamer.parse_input():
    fingerprints = fp_calc.FingerprintFromSmiles(smiles_chunk, fp='mqn')
    # Save fingerprints...

# 3. Train PQ encoder on sample
encoder = PQEncoder(k=256, m=6, iterations=20)
encoder.fit(training_fingerprints)

# 4. Transform all fingerprints to PQ codes
pq_codes = encoder.transform(fingerprints)

# 5. Cluster with PQk-means
clusterer = PQKMeans(encoder, k=100000)
labels = clusterer.fit_predict(pq_codes)

Project Structure

chelombus/
├── chelombus/
│   ├── encoder/          # Product Quantization encoder
│   ├── clustering/       # PQk-means wrapper
│   ├── streamer/         # Memory-efficient data streaming
│   └── utils/            # Fingerprints, visualization, helpers
├── scripts/              # Pipeline scripts
├── examples/             # Tutorial notebooks
└── tests/                # Unit tests

Documentation

  • Tutorial: See examples/tutorial.ipynb for a hands-on introduction
  • Large-scale example: See examples/enamine_1B_clustering.ipynb
  • API Reference: Generated from docstrings using Sphinx (see docs/)

Testing

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_encoder.py -v

Citation

If you use Chelombus in your research, please cite:

@article{chelombus2025,
  title={Nested TMAPs to visualize Billions of Molecules},
  author={Flores Sepulveda, Alejandro and Reymond, Jean-Louis},
  journal={},
  year={2025}
}

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Write tests for new functionality
  4. Submit a pull request

License

MIT License. See LICENSE for details.

Acknowledgments

  • PQk-means by Matsui et al.
  • TMAP by Probst & Reymond
  • RDKit for cheminformatics functionality
  • Swiss National Science Foundation (grant no. 200020_178998)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chelombus-0.1.0.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chelombus-0.1.0-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file chelombus-0.1.0.tar.gz.

File metadata

  • Download URL: chelombus-0.1.0.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for chelombus-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c686a7ca228574484be04368b42fb876784a915f2ca0d10338f7e201f6e12a23
MD5 cf47d2a36f6ec5e66665a1ac8ce6967f
BLAKE2b-256 a84779ade94ff89af9305bb3af730b4b6c4ba79c44ea1d5c66d43a5f02d062c5

See more details on using hashes here.

File details

Details for the file chelombus-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chelombus-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for chelombus-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 69d77a7b582b05455f722c7e8820d3c0f6c9a6cbb10182a2654ca2c318a8e082
MD5 35e6e5f9f36be56cf0b5b8b2837773a9
BLAKE2b-256 abda08c1f70816e8802edc907880a8b5bf07e16cd8ae45b20768e5f6facea8d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page