Skip to main content

Python bindings for COSMolKit

Project description

COSMolKit

coverage workflow badge codecov badge crates.io badge docs.rs badge pypi badge

COSMolKit is a Python molecular toolkit backed by a Rust core. It provides value-style molecule operations, SMILES and SDF workflows, 2D depiction, fingerprints, batch processing, and protein-focused structural biology APIs.

The library is built around explicit behavior: supported operations return structured results, unsupported behavior fails visibly, and public molecule transforms are explicit about whether they return new values or mutate in place.

COSMolKit is designed for array-oriented structural data access, keeping molecular data efficient and natural for NumPy, PyTorch, and model-building workflows.

Documentation

Installation

pip install cosmolkit

Core Concepts

  • Value-style molecules: methods such as with_hydrogens(), without_hydrogens(), with_kekulized_bonds(), and with_2d_coords() return new molecule values.
  • Explicit mutation: in-place Molecule operations always end with _. The trailing underscore has no other public Molecule meaning.
  • Explicit errors: invalid input and unsupported behavior are surfaced as errors instead of silent fallbacks.
  • Batch-native processing: MoleculeBatch keeps input order, supports structured per-record failures, and can run batch transforms and exports with configurable parallelism.
  • Array-friendly data access: coordinates, bounds matrices, fingerprints, and graph features are exposed in forms that fit Python numerical workflows.

Value-Style Transformations

Normal molecule operations return new objects and do not mutate their inputs. This follows the same explicit-dataflow direction as modern dataframe libraries: users can reason about each transformation as a new value while COSMolKit can share unchanged internal storage efficiently.

from cosmolkit import Molecule

mol = Molecule.from_smiles("CCO")
mol_h = mol.with_hydrogens()

assert mol is not mol_h

Python Quick Start

from cosmolkit import Molecule, MoleculeBatch

mol = Molecule.from_smiles("c1ccccc1O")
mol_2d = mol.with_2d_coords()

print(mol_2d.to_smiles())
print(mol_2d.coords_2d())

svg = mol_2d.to_svg(width=400, height=300)
mol_2d.write_png("phenol.png", width=400, height=300)

fp = mol.fingerprint_morgan(radius=2, n_bits=2048)
print(fp.on_bits())

batch = (
    MoleculeBatch.from_smiles_list(
        ["CCO", "c1ccccc1", "CC(=O)O"],
        sanitize=True,
        errors="keep",
    )
    .with_parallel_jobs(8)
    .with_progress_bar(False)
)

prepared = batch.add_hydrogens(errors="keep").compute_2d_coords(errors="keep")
print(prepared.valid_mask())
print(prepared.to_smiles_list())

prepared.to_images(
    "molecule_images",
    format="png",
    size=(300, 300),
    errors="keep",
    filenames=["ethanol", "benzene", "acetate"],
)

Protein Structures

Use Protein when the workflow is focused on protein chains rather than the full structural table.

from cosmolkit import Protein

protein = Protein.from_pdb("1crn.pdb")

print(protein.num_chains())
print(protein.num_residues())
print(protein.num_atoms())

for chain in protein.chains():
    print(chain.index(), chain.kind(), len(chain))
    for residue in chain.residues():
        print(residue.name(), residue.kind(), len(residue))

SDF and Dataset Workflows

SdfDataset builds a lightweight index of SDF record byte ranges, so individual records and chunks can be read without loading an entire file into memory.

from cosmolkit import SdfDataset

dataset = SdfDataset.open("library.sdf")
print(len(dataset))

record = dataset[0]
mol = record.molecule()

for batch in dataset.batches(size=1024, errors="keep", n_jobs=8):
    smiles = batch.to_smiles_list()

Feature Areas

  • Molecular graph construction and inspection
  • SMILES parsing and writing
  • MOL/SDF reading and writing
  • XYZ block reading
  • Hydrogen transforms and Kekulization
  • Sanitization and chemistry problem detection
  • 2D coordinate generation and SVG/PNG depiction
  • Morgan and Avalon fingerprints
  • Distance-geometry bounds matrices
  • Substructure matching and SMARTS parsing
  • Ordered batch transforms and exports
  • PDB/mmCIF molecule-block parsing and protein projection APIs
  • Support-status metadata for public features

Design Principles

COSMolKit aims to be Python-friendly, batch-friendly, and suitable for model-building workflows.

  • Correctness comes before breadth.
  • Public transforms use value semantics.
  • Mutation-capable workflows are explicit.
  • Unsupported chemistry should fail clearly.
  • RDKit-parity behavior is the correctness floor for supported cheminformatics features.
  • High-throughput APIs should preserve input order and expose per-record failures.

Examples

Python examples live in python/examples/.

Roadmap

Status labels:

  • ✅ available in the public Python API
  • 🧪 implemented or partially available, still being hardened
  • 🚧 planned / not yet public

Chemistry Core

Goal: keep the supported molecular core correct before expanding breadth.

  • ✅ Molecule, atom, and bond graph model
  • ✅ SMILES parsing
  • ✅ SMILES writing with RDKit-style writer options for supported branches
  • ✅ Ring perception, valence handling, aromaticity, and Kekulization
  • ✅ Hydrogen addition and removal
  • ✅ Sanitization for supported chemistry workflows
  • ✅ Stereochemistry inspection for supported atom and bond states
  • ✅ Distance-geometry bounds matrices
  • ✅ Morgan fingerprints and Tanimoto similarity
  • 🧪 Avalon fingerprints
  • 🧪 Substructure matching and SMARTS parsing
  • 🚧 Broader descriptor APIs such as formula, molecular weight, and ring statistics

File I/O and Depiction

Goal: make common molecule import, export, and visualization workflows usable from Python.

  • ✅ MOL/SDF reading
  • ✅ XYZ block reading
  • ✅ SDF dataset indexing for large files
  • ✅ SDF writing for supported V2000/V3000 branches
  • ✅ PDB block to molecule conversion
  • ✅ mmCIF block to molecule conversion through the same molecule-conversion profile
  • ✅ 2D coordinate generation
  • ✅ SVG drawing
  • ✅ PNG export
  • 🧪 RDKit-style visual parity testing for supported depiction output
  • 🚧 Annotation overlays and richer drawing customization
  • 🚧 3D conformer generation and embedding APIs

Batch-Native Workflows

Goal: make high-throughput molecule preparation and export a core product identity.

  • ✅ Ordered MoleculeBatch.from_smiles_list()
  • ✅ Batch transforms for sanitization, hydrogens, Kekulization, and 2D coordinates
  • ✅ Configurable parallelism with with_parallel_jobs()
  • ✅ Configurable progress display with with_progress_bar()
  • ✅ Per-record errors, valid masks, and error reports
  • ✅ Batch SMILES, image, and SDF export paths
  • 🧪 Golden parity tests for parallel batch behavior
  • 🚧 More streaming and chunked dataset workflows

Protein and Structural Biology

Goal: provide practical Biopython-like structure workflows without forcing users through low-level structural tables.

  • Protein.from_pdb() / Protein.from_mmcif() high-level entry points
  • ✅ Protein chain, residue, and atom iteration
  • ✅ Protein-only projection from broader structural data
  • 🧪 PDB/mmCIF structural parsing
  • 🚧 Selection utilities for chains, residues, atoms, and neighborhoods
  • 🚧 Ligand, nucleic-acid, and mixed-structure ergonomic APIs

Python API and ML Readiness

Goal: expose verified molecular behavior through a practical Python interface.

  • ✅ Value-style molecule transformations
  • ✅ Graph, coordinate, fingerprint, and bounds-matrix accessors
  • ✅ Python examples for drawing, SDF-to-SMILES, batch processing, and proteins
  • 🧪 Type stubs and documentation coverage
  • 🚧 Stable model-ready graph exports
  • 🚧 NumPy / PyTorch oriented adapters
  • 🚧 Molecular tokenization and AI-native geometry helpers

Browser and Deployment

Goal: support lightweight chemistry workflows outside native Python processes.

  • 🚧 WASM compilation target
  • 🚧 JavaScript bindings
  • 🚧 Browser-native SMILES/SDF parsing and depiction

Respect for RDKit

COSMolKit is developed with deep respect for RDKit and the broader open-source cheminformatics community. The goal is an independent Rust-native implementation that preserves interoperability and RDKit-parity behavior where appropriate, while offering a deterministic Python API and AI-native extension surface.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cosmolkit-0.2.3.tar.gz (1.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cosmolkit-0.2.3-cp39-abi3-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.9+Windows x86-64

cosmolkit-0.2.3-cp39-abi3-manylinux_2_35_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.35+ x86-64

cosmolkit-0.2.3-cp39-abi3-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file cosmolkit-0.2.3.tar.gz.

File metadata

  • Download URL: cosmolkit-0.2.3.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cosmolkit-0.2.3.tar.gz
Algorithm Hash digest
SHA256 d040c37897206dc51fe120fdedbddcf48a6be625f73bef499a680c4457f57fe4
MD5 44cc9d6e1d7b471e92af0fca947521cb
BLAKE2b-256 060ee3ca3d24492dce2e51330e15e0eb599555102d99cbdd7cac21258ee2e279

See more details on using hashes here.

File details

Details for the file cosmolkit-0.2.3-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: cosmolkit-0.2.3-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cosmolkit-0.2.3-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 85e6b321f5ddc02dd0e29e7e418c6c07fb7262940e49a8b51a08b9ae1653887a
MD5 8a9ddf204c39e8b6a79f4d143d315041
BLAKE2b-256 9849f2b926bb33bfe9c291b9643f87ac4a738f819866962dfad9cdd5d3dbb7cc

See more details on using hashes here.

File details

Details for the file cosmolkit-0.2.3-cp39-abi3-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for cosmolkit-0.2.3-cp39-abi3-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 bc0f0b28ff94a0a64de8f267bf7c9cdcb40213d40e2f8c54961c436fd38e4132
MD5 0e0de2884c72b0559648dd85a70135e6
BLAKE2b-256 478cf466530c6c48a29cffb5692d7e2044a14265b90d487771972b0dec630c2a

See more details on using hashes here.

File details

Details for the file cosmolkit-0.2.3-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cosmolkit-0.2.3-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 741015bc22eb3eedc96265d735649ea7c438c2285e24613abee944d88bbca272
MD5 45d39b39473e94df4dfe0a53a5e97de5
BLAKE2b-256 f531b510a90b69f49757a2c103b0fea9c18e6670187e0fee6f4e0bd46621f47f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page