Billion-scale molecular clustering and visualization using Product Quantization and nested TMAPs.
Project description
Chelombus
Billion-scale molecular clustering and visualization on commodity hardware.
Chelombus enables interactive exploration of ultra-large chemical datasets (up to billions of molecules) using Product Quantization and nested TMAPs. Process the entire Enamine REAL database (9.6B molecules) on a single workstation.
Live Demo: https://chelombus.gdb.tools
Overview
Chelombus implements the "Nested TMAP" framework for visualizing billion-sized molecular datasets:
SMILES → MQN Fingerprints → PQ Encoding → PQk-means Clustering → Nested TMAPs
Key Features:
- Scalability: Stream billions of molecules without loading everything into memory
- Efficiency: Compress 42-dimensional MQN vectors to 6-byte PQ codes (28x compression)
- Visualization: Navigate from global overview to individual molecules in two clicks
- Accessibility: Runs on commodity hardware (tested: AMD Ryzen 7, 64GB RAM)
Installation
# Clone the repository
git clone https://github.com/afloresep/chelombus.git
cd chelombus
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install with core dependencies
pip install -e .
# Install with clustering support
pip install -e ".[clustering]"
# Install with visualization support
pip install -e ".[visualization]"
# Install everything
pip install -e ".[all]"
Platform Notes
Apple Silicon (M1/M2/M3): The pqkmeans library is not currently supported on Apple Silicon Macs. My plan is to rewrite pqkmeans with Silicon and GPU support but that's for a future release... For now, clustering functionality requires an x86_64 system.
Quick Start
from chelombus import DataStreamer, FingerprintCalculator, PQEncoder, PQKMeans
# 1. Stream SMILES in chunks
streamer = DataStreamer(path='molecules.smi', chunksize=100000)
# 2. Calculate MQN fingerprints
fp_calc = FingerprintCalculator()
for smiles_chunk in streamer.parse_input():
fingerprints = fp_calc.FingerprintFromSmiles(smiles_chunk, fp='mqn')
# Save fingerprints...
# 3. Train PQ encoder on sample
encoder = PQEncoder(k=256, m=6, iterations=20)
encoder.fit(training_fingerprints)
# 4. Transform all fingerprints to PQ codes
pq_codes = encoder.transform(fingerprints)
# 5. Cluster with PQk-means
clusterer = PQKMeans(encoder, k=100000)
labels = clusterer.fit_predict(pq_codes)
Project Structure
chelombus/
├── chelombus/
│ ├── encoder/ # Product Quantization encoder
│ ├── clustering/ # PQk-means wrapper
│ ├── streamer/ # Memory-efficient data streaming
│ └── utils/ # Fingerprints, visualization, helpers
├── scripts/ # Pipeline scripts
├── examples/ # Tutorial notebooks
└── tests/ # Unit tests
Documentation
- Tutorial: See
examples/tutorial.ipynbfor a hands-on introduction - Large-scale example: See
examples/enamine_1B_clustering.ipynb - API Reference: Generated from docstrings using Sphinx (see
docs/)
Testing
# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_encoder.py -v
Citation
If you use Chelombus in your research, please cite:
@article{chelombus2025,
title={Nested TMAPs to visualize Billions of Molecules},
author={Flores Sepulveda, Alejandro and Reymond, Jean-Louis},
journal={},
year={2025}
}
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Write tests for new functionality
- Submit a pull request
License
MIT License. See LICENSE for details.
Acknowledgments
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chelombus-0.1.0.tar.gz.
File metadata
- Download URL: chelombus-0.1.0.tar.gz
- Upload date:
- Size: 36.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c686a7ca228574484be04368b42fb876784a915f2ca0d10338f7e201f6e12a23
|
|
| MD5 |
cf47d2a36f6ec5e66665a1ac8ce6967f
|
|
| BLAKE2b-256 |
a84779ade94ff89af9305bb3af730b4b6c4ba79c44ea1d5c66d43a5f02d062c5
|
File details
Details for the file chelombus-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chelombus-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69d77a7b582b05455f722c7e8820d3c0f6c9a6cbb10182a2654ca2c318a8e082
|
|
| MD5 |
35e6e5f9f36be56cf0b5b8b2837773a9
|
|
| BLAKE2b-256 |
abda08c1f70816e8802edc907880a8b5bf07e16cd8ae45b20768e5f6facea8d5
|