Skip to main content

Python bindings for COSMolKit

Project description

COSMolKit

coverage workflow badge codecov badge crates.io badge docs.rs badge pypi badge

COSMolKit is a Python molecular toolkit backed by a Rust core. It provides value-style molecule operations, SMILES/SDF/MOL2/XYZ workflows, 2D depiction, native 3D conformer generation, UFF/MMFF optimization, fingerprints, batch processing, and protein-focused structural biology APIs.

The library is built around explicit behavior: supported operations return structured results, unsupported behavior fails visibly, and public molecule transforms are explicit about whether they return new values or mutate in place.

COSMolKit is designed for array-oriented structural data access, keeping molecular data efficient and natural for NumPy, PyTorch, and model-building workflows.

Documentation

Installation

pip install cosmolkit

Core Concepts

  • Value-style molecules: methods such as with_hydrogens(), without_hydrogens(), with_kekulized_bonds(), and with_2d_coordinates() return new molecule values.
  • Explicit mutation: in-place Molecule operations always end with _. The trailing underscore has no other public Molecule meaning.
  • Explicit errors: invalid input and unsupported behavior are surfaced as errors instead of silent fallbacks.
  • Batch-native processing: MoleculeBatch keeps input order, supports structured per-record failures, and can run batch transforms and exports with configurable parallelism.
  • Array-friendly data access: coordinates, bounds matrices, fingerprints, and graph features are exposed in forms that fit Python numerical workflows.
  • Source-backed 3D workflows: conformer generation and UFF/MMFF optimization are available through the public Python API.

Value-Style Transformations

Normal molecule operations return new objects and do not mutate their inputs. This follows the same explicit-dataflow direction as modern dataframe libraries: users can reason about each transformation as a new value while COSMolKit can share unchanged internal storage efficiently.

from cosmolkit import Molecule

mol = Molecule.from_smiles("CCO")
mol_h = mol.with_hydrogens()

assert mol is not mol_h

Python Quick Start

from cosmolkit import Molecule, MoleculeBatch

mol = Molecule.from_smiles("c1ccccc1O")
mol_2d = mol.with_2d_coordinates()

print(mol_2d.to_smiles())
print(mol_2d.coords_2d())

mol_3d = mol.with_hydrogens().with_3d_conformer()
print(mol_3d.coords_3d().shape)

svg = mol_2d.to_svg(width=400, height=300)
mol_2d.write_png("phenol.png", width=400, height=300)

fp = mol.fingerprint_morgan(radius=2, n_bits=2048)
print(fp.on_bits())

batch = (
    MoleculeBatch.from_smiles_list(
        ["CCO", "c1ccccc1", "CC(=O)O"],
        sanitize=True,
        errors="keep",
    )
    .with_parallel_jobs(8)
    .with_progress_bar(False)
)

prepared = batch.with_hydrogens(errors="keep").with_2d_coordinates(errors="keep")
print(prepared.valid_mask())
print(prepared.to_smiles_list())

prepared.to_images(
    "molecule_images",
    format="png",
    size=(300, 300),
    errors="keep",
    filenames=["ethanol", "benzene", "acetate"],
)

Protein Structures

Use Protein when the workflow is focused on protein chains rather than the full structural table.

from cosmolkit import Protein

protein = Protein.from_pdb("1crn.pdb")

print(protein.num_chains())
print(protein.num_residues())
print(protein.num_atoms())

for chain in protein.chains():
    print(chain.index(), chain.kind(), len(chain))
    for residue in chain.residues():
        print(residue.name(), residue.kind(), len(residue))

SDF and Dataset Workflows

SdfDataset builds a lightweight index of SDF record byte ranges, so individual records and chunks can be read without loading an entire file into memory. Molfile-only readers such as Molecule.read_mol() follow RDKit MolFromMolBlock boundaries: they stop after the first M END line and leave trailing SDF data fields to the SDF APIs.

from cosmolkit import SdfDataset

dataset = SdfDataset.open("library.sdf")
print(len(dataset))

record = dataset[0]
mol = record.molecule()

for batch in dataset.batches(size=1024, errors="keep", n_jobs=8):
    smiles = batch.to_smiles_list()

Conformer Generation And Optimization

from cosmolkit import EmbedParameters, Molecule

mol = Molecule.from_smiles("CC(=O)NC").with_hydrogens()

params = EmbedParameters.etkdg_v3()
params.random_seed = 0xF00D
params.num_threads = 1
params.track_failures = True

embedded = mol.with_3d_conformer(params)
print(embedded.num_conformers())
print(embedded.coords_3d().shape)
print(params.failures)

multi = mol.with_3d_conformers(5, params)
print(multi.num_conformers())

if embedded.has_uff_params():
    uff = embedded.with_uff_optimized(max_iters=200)
    print(uff.energy())

if embedded.has_mmff_params():
    mmff = embedded.with_mmff_optimized(max_iters=200)
    print(mmff.needs_more())

Feature Areas

  • Molecular graph construction and inspection
  • SMILES parsing and writing
  • MOL/SDF reading and writing
  • MOL2 reading with RDKit-style Mol2ParserParams
  • XYZ block reading
  • Hydrogen transforms and Kekulization
  • Sanitization and chemistry problem detection
  • 2D coordinate generation and SVG/PNG depiction
  • Native 3D conformer generation with DG/KDG/ETDG/ETKDG parameter presets
  • UFF/MMFF optimization of generated or imported 3D conformers
  • Morgan and Avalon fingerprints
  • Distance-geometry bounds matrices
  • Substructure matching and SMARTS parse metadata
  • Ordered batch transforms and exports
  • PDB/mmCIF molecule-block parsing and protein projection APIs
  • Support-status metadata for public features

Design Principles

COSMolKit aims to be Python-friendly, batch-friendly, and suitable for model-building workflows.

  • Correctness comes before breadth.
  • Public transforms use value semantics.
  • Mutation-capable workflows are explicit.
  • Unsupported chemistry should fail clearly.
  • RDKit-parity behavior is the correctness floor for supported cheminformatics features.
  • High-throughput APIs should preserve input order and expose per-record failures.

Examples

Python examples live in python/examples/.

Roadmap

Status labels:

  • ✅ available in the public Python API
  • 🧪 implemented or partially available, still being hardened
  • 🚧 planned / not yet public

Chemistry Core

Goal: keep the supported molecular core correct before expanding breadth.

  • ✅ Molecule, atom, and bond graph model
  • ✅ SMILES parsing
  • ✅ SMILES writing with RDKit-style writer options for supported branches
  • ✅ Ring perception, valence handling, aromaticity, and Kekulization
  • ✅ Hydrogen addition and removal
  • ✅ Sanitization for supported chemistry workflows
  • ✅ Stereochemistry inspection for supported atom and bond states
  • ✅ Distance-geometry bounds matrices
  • ✅ Native 3D conformer generation and UFF/MMFF post-optimization for supported molecules
  • 🧪 Morgan fingerprints and Tanimoto similarity
  • 🧪 Avalon fingerprints
  • 🧪 Substructure matching and Python SMARTS parse metadata
  • 🚧 Broader descriptor APIs such as formula, molecular weight, and ring statistics

File I/O and Depiction

Goal: make common molecule import, export, and visualization workflows usable from Python.

  • ✅ MOL/SDF reading
  • ✅ MOL2 reading
  • ✅ XYZ block reading
  • ✅ SDF dataset indexing for large files
  • ✅ SDF writing for supported V2000/V3000 branches
  • ✅ PDB block to molecule conversion
  • ✅ mmCIF block to molecule conversion through the same molecule-conversion profile
  • ✅ 2D coordinate generation
  • ✅ SVG drawing
  • ✅ PNG export
  • 🧪 RDKit-style visual parity testing for supported depiction output
  • 🚧 Annotation overlays and richer drawing customization
  • ✅ 3D conformer generation and embedding APIs

Batch-Native Workflows

Goal: make high-throughput molecule preparation and export a core product identity.

  • ✅ Ordered MoleculeBatch.from_smiles_list()
  • ✅ Batch transforms for sanitization, hydrogens, Kekulization, and 2D coordinates
  • ✅ Configurable parallelism with with_parallel_jobs()
  • ✅ Configurable progress display with with_progress_bar()
  • ✅ Per-record errors, valid masks, and error reports
  • ✅ Batch SMILES, image, and SDF export paths
  • 🧪 Golden parity tests for parallel batch behavior
  • 🚧 More streaming and chunked dataset workflows

Protein and Structural Biology

Goal: provide practical Biopython-like structure workflows without forcing users through low-level structural tables.

  • Protein.from_pdb() / Protein.from_mmcif() high-level entry points
  • ✅ Protein chain, residue, and atom iteration
  • ✅ Protein-only projection from broader structural data
  • 🧪 PDB/mmCIF structural parsing
  • 🚧 Selection utilities for chains, residues, atoms, and neighborhoods
  • 🚧 Ligand, nucleic-acid, and mixed-structure ergonomic APIs

Python API and ML Readiness

Goal: expose verified molecular behavior through a practical Python interface.

  • ✅ Value-style molecule transformations
  • ✅ Graph, coordinate, fingerprint, and bounds-matrix accessors
  • ✅ Python examples for drawing, SDF-to-SMILES, batch processing, and proteins
  • 🧪 Type stubs and documentation coverage
  • 🚧 Stable model-ready graph exports
  • 🚧 NumPy / PyTorch oriented adapters
  • 🚧 Molecular tokenization and AI-native geometry helpers

Browser and Deployment

Goal: support lightweight chemistry workflows outside native Python processes.

  • 🚧 WASM compilation target
  • 🚧 JavaScript bindings
  • 🚧 Browser-native SMILES/SDF parsing and depiction

Respect for RDKit

COSMolKit is developed with deep respect for RDKit and the broader open-source cheminformatics community. The goal is an independent Rust-native implementation that preserves interoperability and RDKit-parity behavior where appropriate, while offering a deterministic Python API and AI-native extension surface.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cosmolkit-0.2.5.tar.gz (1.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cosmolkit-0.2.5-cp39-abi3-win_amd64.whl (3.6 MB view details)

Uploaded CPython 3.9+Windows x86-64

cosmolkit-0.2.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

cosmolkit-0.2.5-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

cosmolkit-0.2.5-cp39-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (7.1 MB view details)

Uploaded CPython 3.9+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file cosmolkit-0.2.5.tar.gz.

File metadata

  • Download URL: cosmolkit-0.2.5.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cosmolkit-0.2.5.tar.gz
Algorithm Hash digest
SHA256 28cf478929a9b8ef856bd979a8ae57cb3d63eae1e2409b5b88a123cd7543193c
MD5 19238541bd574401822697e1f2169327
BLAKE2b-256 b51760fc5366b32dbcfa94126c3595ad0fbfbe408adccf87ae6957801f4c5d85

See more details on using hashes here.

File details

Details for the file cosmolkit-0.2.5-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: cosmolkit-0.2.5-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.6 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cosmolkit-0.2.5-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d28001f657182f8190647af95b5d232c4c95bb5120bbb0765d231dd304ef80b0
MD5 8c3fd40b78819d0c3f388286419a943b
BLAKE2b-256 9b037e1594d47de5dd284c05ffb25e28efe6a6d7a7d4a30ba2e3927ef30e7cf6

See more details on using hashes here.

File details

Details for the file cosmolkit-0.2.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for cosmolkit-0.2.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3099813fea463d52dd9b99421db32e14e660ff434747c16b1430fba4288af8b8
MD5 ec7415e5327e6a5594d0ff616007a97a
BLAKE2b-256 8bb3db5acee783e4f1d53d722424f23d799a92b534d087f013564f5983bfb604

See more details on using hashes here.

File details

Details for the file cosmolkit-0.2.5-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for cosmolkit-0.2.5-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2f9005ac109169a10e4a1cc23302e89f287f7c0c00d6bf09acb9a73ca7cebf9d
MD5 e5935f8680311b4fb633b259edeac94e
BLAKE2b-256 cc91f689544f8d733689469ad61cf75d134e4a520d49e0ac0e353042588c3252

See more details on using hashes here.

File details

Details for the file cosmolkit-0.2.5-cp39-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for cosmolkit-0.2.5-cp39-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 930ed81bb7fecfbd2fe51c147c6652875343d9649f5b8cc2d88f01e630970ac0
MD5 32509abc7b536396b752e2e908332129
BLAKE2b-256 72d6aa88320550b09264876fe134bebceb601efa5bf2e220203522fe6484e12a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page