Python bindings for COSMolKit
Project description
COSMolKit
COSMolKit is a Python molecular toolkit backed by a Rust core. It provides value-style molecule operations, SMILES and SDF workflows, 2D depiction, fingerprints, batch processing, and protein-focused structural biology APIs.
The library is built around explicit behavior: supported operations return structured results, unsupported behavior fails visibly, and public molecule transforms return new values instead of mutating their inputs.
COSMolKit is designed for array-oriented structural data access, keeping molecular data efficient and natural for NumPy, PyTorch, and model-building workflows.
Documentation
- Python documentation: https://kit.cosmol.org/
- Rust and development notes:
crates/cosmolkit/README.md
Installation
pip install cosmolkit
Core Concepts
- Value-style molecules: methods such as
with_hydrogens(),without_hydrogens(),with_kekulized_bonds(), andwith_2d_coords()return new molecule values. - Explicit errors: invalid input and unsupported behavior are surfaced as errors instead of silent fallbacks.
- Batch-native processing:
MoleculeBatchkeeps input order, supports structured per-record failures, and can run batch transforms and exports with configurable parallelism. - Array-friendly data access: coordinates, bounds matrices, fingerprints, and graph features are exposed in forms that fit Python numerical workflows.
Value-Style Transformations
Normal molecule operations return new objects and do not mutate their inputs. This follows the same explicit-dataflow direction as modern dataframe libraries: users can reason about each transformation as a new value while the Rust core can share unchanged internal storage efficiently.
from cosmolkit import Molecule
mol = Molecule.from_smiles("CCO")
mol_h = mol.with_hydrogens()
assert mol is not mol_h
Python Quick Start
from cosmolkit import Molecule, MoleculeBatch
mol = Molecule.from_smiles("c1ccccc1O")
mol_2d = mol.with_2d_coords()
print(mol_2d.to_smiles())
print(mol_2d.coords_2d())
svg = mol_2d.to_svg(width=400, height=300)
mol_2d.write_png("phenol.png", width=400, height=300)
fp = mol.fingerprint_morgan(radius=2, n_bits=2048)
print(fp.on_bits())
batch = (
MoleculeBatch.from_smiles_list(
["CCO", "c1ccccc1", "CC(=O)O"],
sanitize=True,
errors="keep",
)
.with_parallel_jobs(8)
.with_progress_bar(False)
)
prepared = batch.add_hydrogens(errors="keep").compute_2d_coords(errors="keep")
print(prepared.valid_mask())
print(prepared.to_smiles_list())
prepared.to_images(
"molecule_images",
format="png",
size=(300, 300),
errors="keep",
filenames=["ethanol", "benzene", "acetate"],
)
Protein Structures
Use Protein when the workflow is focused on protein chains rather than the
full structural table.
from cosmolkit import Protein
protein = Protein.from_pdb("1crn.pdb")
print(protein.num_chains())
print(protein.num_residues())
print(protein.num_atoms())
for chain in protein.chains():
print(chain.index(), chain.kind(), len(chain))
for residue in chain.residues():
print(residue.name(), residue.kind(), len(residue))
SDF and Dataset Workflows
SdfDataset builds a lightweight index of SDF record byte ranges, so individual
records and chunks can be read without loading an entire file into memory.
from cosmolkit import SdfDataset
dataset = SdfDataset.open("library.sdf")
print(len(dataset))
record = dataset[0]
mol = record.molecule()
for batch in dataset.batches(size=1024, errors="keep", n_jobs=8):
smiles = batch.to_smiles_list()
Feature Areas
- Molecular graph construction and inspection
- SMILES parsing and writing
- MOL/SDF reading and writing
- Hydrogen transforms and Kekulization
- Sanitization and chemistry problem detection
- 2D coordinate generation and SVG/PNG depiction
- Morgan and Avalon fingerprints
- Distance-geometry bounds matrices
- Substructure matching and SMARTS parsing
- Ordered batch transforms and exports
- PDB/mmCIF molecule-block parsing and protein projection APIs
- Support-status metadata for public features
Design Principles
COSMolKit aims to be Python-friendly, batch-friendly, and suitable for model-building workflows.
- Correctness comes before breadth.
- Public transforms use value semantics.
- Mutation-capable workflows are explicit.
- Unsupported chemistry should fail clearly.
- RDKit-parity behavior is the correctness floor for supported cheminformatics features.
- High-throughput APIs should preserve input order and expose per-record failures.
Examples
Python examples live in python/examples/.
Roadmap
Status labels:
- ✅ available in the public Python API
- 🧪 implemented or partially available, still being hardened
- 🚧 planned / not yet public
Chemistry Core
Goal: keep the supported molecular core correct before expanding breadth.
- ✅ Molecule, atom, and bond graph model
- ✅ SMILES parsing
- ✅ SMILES writing with RDKit-style writer options for supported branches
- ✅ Ring perception, valence handling, aromaticity, and Kekulization
- ✅ Hydrogen addition and removal
- ✅ Sanitization for supported chemistry workflows
- ✅ Stereochemistry inspection for supported atom and bond states
- ✅ Distance-geometry bounds matrices
- ✅ Morgan fingerprints and Tanimoto similarity
- 🧪 Avalon fingerprints
- 🧪 Substructure matching and SMARTS parsing
- 🚧 Broader descriptor APIs such as formula, molecular weight, and ring statistics
File I/O and Depiction
Goal: make common molecule import, export, and visualization workflows usable from Python.
- ✅ MOL/SDF reading
- ✅ SDF dataset indexing for large files
- ✅ SDF writing for supported V2000/V3000 branches
- ✅ PDB block to molecule conversion
- ✅ mmCIF block to molecule conversion through the same molecule-conversion profile
- ✅ 2D coordinate generation
- ✅ SVG drawing
- ✅ PNG export
- 🧪 RDKit-style visual parity testing for supported depiction output
- 🚧 Annotation overlays and richer drawing customization
- 🚧 3D conformer generation and embedding APIs
Batch-Native Workflows
Goal: make high-throughput molecule preparation and export a core product identity.
- ✅ Ordered
MoleculeBatch.from_smiles_list() - ✅ Batch transforms for sanitization, hydrogens, Kekulization, and 2D coordinates
- ✅ Configurable parallelism with
with_parallel_jobs() - ✅ Configurable progress display with
with_progress_bar() - ✅ Per-record errors, valid masks, and error reports
- ✅ Batch SMILES, image, and SDF export paths
- 🧪 Golden parity tests for parallel batch behavior
- 🚧 More streaming and chunked dataset workflows
Protein and Structural Biology
Goal: provide practical Biopython-like structure workflows without forcing users through low-level structural tables.
- ✅
Protein.from_pdb()/Protein.from_mmcif()high-level entry points - ✅ Protein chain, residue, and atom iteration
- ✅ Protein-only projection from broader structural data
- 🧪 PDB/mmCIF structural parsing
- 🚧 Selection utilities for chains, residues, atoms, and neighborhoods
- 🚧 Ligand, nucleic-acid, and mixed-structure ergonomic APIs
Python API and ML Readiness
Goal: expose verified Rust-backed behavior through a practical Python interface.
- ✅ Value-style molecule transformations
- ✅ Graph, coordinate, fingerprint, and bounds-matrix accessors
- ✅ Python examples for drawing, SDF-to-SMILES, batch processing, and proteins
- 🧪 Type stubs and documentation coverage
- 🚧 Stable model-ready graph exports
- 🚧 NumPy / PyTorch oriented adapters
- 🚧 Molecular tokenization and AI-native geometry helpers
Browser and Deployment
Goal: support lightweight chemistry workflows outside native Python processes.
- 🚧 WASM compilation target
- 🚧 JavaScript bindings
- 🚧 Browser-native SMILES/SDF parsing and depiction
Respect for RDKit
COSMolKit is developed with deep respect for RDKit and the broader open-source cheminformatics community. The goal is an independent Rust-native implementation that preserves interoperability and RDKit-parity behavior where appropriate, while offering a deterministic Python API and AI-native extension surface.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cosmolkit-0.2.1.tar.gz.
File metadata
- Download URL: cosmolkit-0.2.1.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8010b27d00ed6e896f6b14928a7adb4db6268684fd38fa83b3a4a956d1ce6af
|
|
| MD5 |
d6fe1d6a4ec109a8da921de2c0179540
|
|
| BLAKE2b-256 |
a2537d48d106fe1434c702ea57cda9b7b8f673532492b3bf4720f32cdd07cccc
|
File details
Details for the file cosmolkit-0.2.1-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: cosmolkit-0.2.1-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19889a5f919cad65ca1758b8baa2891b87a0c643ee645e0995008706a7ee7396
|
|
| MD5 |
6c680e3d31d16ed7630ba34221f7c4fc
|
|
| BLAKE2b-256 |
9b7f8ff0a43b9c717906e4ddeeed936876a4109efc7f7697fb1ad85b238d50d5
|
File details
Details for the file cosmolkit-0.2.1-cp39-abi3-manylinux_2_35_x86_64.whl.
File metadata
- Download URL: cosmolkit-0.2.1-cp39-abi3-manylinux_2_35_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.9+, manylinux: glibc 2.35+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8de7a5d7cfdac935ad40ae9aa1502aa17cb2cc54d890fc9938ac5136c61acb43
|
|
| MD5 |
b891c7c5c91eb34ef3de682f20ae50b0
|
|
| BLAKE2b-256 |
fb39bcd05b04786630209bea4eddf3963c12e3a0f72dc5d0fc38a817f412d77c
|
File details
Details for the file cosmolkit-0.2.1-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: cosmolkit-0.2.1-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89f157042edc5099b104561571592688f33cb62128796aea020c5db8da4d3a15
|
|
| MD5 |
e983829a306a42140790d37c8d36fad1
|
|
| BLAKE2b-256 |
ef559aa6c6488f9315818a542966c4c0cb959b23bea4d7b3b35b2ad07f0c2b72
|