Skip to main content

AI-native cheminformatics: Rust core + RDKit bridge + Python AI API

Project description

molcore-chem

AI-native cheminformatics toolkit — Rust-accelerated fingerprints and PyG conversion, with full RDKit compatibility and a built-in MCP server.

CI PyPI Python 3.11+ License: MIT

pip install molcore-chem

Overview

molcore extends RDKit workflows rather than replacing them. The hot paths — fingerprint generation and PyTorch Geometric graph conversion — are rewritten in Rust using Rayon parallelism and zero-copy array transfer, while standardization, descriptors, and scaffold splitting delegate to RDKit through an isolated bridge layer.

Capability Implementation Notes
ECFP4 fingerprints Rust (Rayon + u64 bit-packing) 35–132× faster than RDKit
PyG graph conversion Rust (IntoPyArray → torch.from_numpy) 4.3× faster, zero-copy
Tanimoto matrix Rust (Rayon + popcount) 4.3–29× faster at scale
Standardization, descriptors, scaffold split RDKit (via rdkit_bridge.py) Parity speed, cleaner API

Quickstart

from molcore.molecule import Mol
from molcore.pipeline import featurize_smiles
from molcore.predictor import PropertyPredictor
from molcore.io import MolDataset
import numpy as np

# Parse — immutable, Rust-backed
mol = Mol.from_smiles("CC(=O)Oc1ccccc1C(=O)O")   # aspirin
data = mol.to_pyg()                                # PyG Data, zero-copy, 9 node features

# Batch fingerprints — Rust Rayon parallel
fps = featurize_smiles(smiles_list, backend="rust")   # (N, 2048) uint8 Tensor

# Full dataset pipeline
ds = MolDataset.from_smiles(smiles_list, compute_fps=True, compute_desc=True)
ds.labels = np.array(logp_values, dtype=np.float32)
train_ds, val_ds, test_ds = ds.scaffold_split()

# Train GCN with MC Dropout uncertainty
pred = PropertyPredictor(hidden=64, epochs=100)
pred.fit(train_ds, val_dataset=val_ds)
means, stds = pred.predict_with_uncertainty(["CCO", "c1ccccc1"], n_samples=30)

Open in Colab →


Benchmarks

All numbers on Apple M-series (arm64), CPU-only, Python 3.12.

ECFP4 Fingerprints

Batch size molcore (Rust) RDKit Speedup
1 000 SMILES 1.3M mol/s 14 800 mol/s 88×
10 000 SMILES 2.0M mol/s 15 100 mol/s 132×

Tanimoto Similarity Matrix

Query × Library molcore (Rust) RDKit BulkTanimoto Speedup
50 × 1 000 31M pairs/s 7.3M pairs/s 4.3×
500 × 10 000 224M pairs/s 7.7M pairs/s 29×

End-to-End Pre-training Pipeline (500 molecules)

Step molcore RDKit Speedup
Standardize 242 ms 225 ms ~parity
ECFP4 fingerprints 1.1 ms 37.3 ms 35×
7 Lipinski descriptors 124 ms 114 ms ~parity
Scaffold split 33 ms 35 ms ~parity
PyG conversion (200 mols) 3.3 ms 14.4 ms 4.3×

GNN Property Prediction — ESOL Solubility

ESOL dataset (Delaney 2004, 1128 molecules), scaffold split. Scaffold split is substantially harder than the random split used in published MoleculeNet baselines — results are not directly comparable to the published RMSE ≈ 0.58.

Configuration RMSE
GCN, hidden=64, 3 layers, 300 epochs 1.038 0.727
Optuna-tuned (30 trials): hidden=128, 4 layers 1.090 0.709

Features

Billion-Scale Streaming Screen

Screen libraries that do not fit in RAM using any Iterable[str] of SMILES — file iterators, database cursors, or generators. Peak memory is O(chunk_size × nbits/8).

from molcore.streaming import stream_screen, StreamingScreen

def from_file(path):
    with open(path) as fh:
        for line in fh:
            yield line.strip().split()[0]

# Tanimoto similarity + SMARTS filter in a single pass
hits = stream_screen(
    from_file("chembl_34.smi"),
    query="c1ccc(N)cc1",
    query_smarts="[NH2]",
    threshold=0.4,
    chunk_size=10_000,
    progress=True,
)
for smiles, tanimoto_score in hits:
    print(smiles, tanimoto_score)

# Stateful version — screen multiple chunks, inspect running stats
screen = StreamingScreen(query="c1ccc(N)cc1", threshold=0.4)
for chunk in my_chunks:
    chunk_hits = screen.screen_chunk(chunk)
    save_hits(chunk_hits)
print(screen.stats)  # {n_screened, n_hits, hit_rate, elapsed_s, rate_mol_s}

MCP Server

Any MCP-compatible host (Claude Desktop, Continue, Cursor) can invoke molcore tools directly without a local Python installation.

molcore mcp                                    # stdio transport
molcore mcp --transport http --port 8765       # HTTP transport

Claude Desktop — add to claude_desktop_config.json:

{
  "mcpServers": {
    "molcore": {
      "command": "python",
      "args": ["-m", "molcore.mcp_server"],
      "env": {}
    }
  }
}

Nine tools are exposed: featurize, screen_smarts, screen_similarity, admet_screen, synthesizability, generate, retro_score, active_suggest, and pareto_optimize.

SDF and Parquet I/O

from molcore.io import MolDataset

ds = MolDataset.from_sdf("library.sdf")
ds = MolDataset.from_sdf("library.sdf", compute_fps=True, compute_desc=True)
ds.write_sdf("output.sdf")
ds.write_parquet("library.parquet")           # Arrow columnar, snappy-compressed
ds2 = MolDataset.read_parquet("library.parquet")

Pandas Integration

import molcore.pandas_tools as mpt

df = mpt.load_sdf("library.sdf")                  # DataFrame with 'Mol' + 'smiles' columns
df = mpt.add_descriptors(df, preset="lipinski")   # MolWt, LogP, TPSA, HBD, HBA, …
df = mpt.add_fingerprints(df, kind="ecfp4")       # adds 'fp' column
df = mpt.filter_by_smarts(df, "c1ccncc1")         # substructure filter in-place
df = mpt.standardize_smiles(df)                   # strip salts → neutralize → canonical tautomer

Descriptors

from molcore.rdkit_bridge import calc_named_descriptors

arr, names = calc_named_descriptors(smiles, preset="lipinski")   # 7 descriptors
arr, names = calc_named_descriptors(smiles, preset="druglike")   # 15 descriptors
arr, names = calc_named_descriptors(smiles, preset="all")        # ~200 descriptors
arr, names = calc_named_descriptors(smiles, names=["MolWt", "TPSA", "BertzCT"])

Returns (N, D) float32 arrays.

Fingerprint Types

fps = featurize_smiles(smiles, kind="ecfp4")                # (N, 2048) — Rust parallel
fps = featurize_smiles(smiles, kind="maccs")                # (N, 167)
fps = featurize_smiles(smiles, kind="atom_pairs")           # (N, 2048)
fps = featurize_smiles(smiles, kind="topological_torsions") # (N, 2048)
fps = featurize_smiles(smiles, kind="rdkit")                # (N, 2048) RDKit path-based

2D Depiction

mol = Mol.from_smiles("CC(=O)Oc1ccccc1C(=O)O")
mol              # renders inline in Jupyter via _repr_svg_
mol.to_png("aspirin.png")

ds = MolDataset.from_sdf("library.sdf")
ds               # renders 8-molecule grid inline
ds.draw_grid(n=20, mols_per_row=4)

Standardization

from molcore.rdkit_bridge import standardize

clean = standardize("[Na+].OC(=O)c1ccccc1")   # → "OC(=O)c1ccccc1"
# strips salts → neutralizes charges → canonical tautomer → canonical SMILES

MCS and R-Group Decomposition

from molcore.rdkit_bridge import find_mcs, rgroup_decompose

smarts = find_mcs(["CC(=O)Oc1ccccc1", "CC(=O)Oc1ccc(F)cc1", "CC(=O)Oc1ccc(Cl)cc1"])

rows = rgroup_decompose("c1ccc([*:1])cc1", smiles_list)
# → [{"Core": "c1ccccc1", "R1": "F"}, {"Core": "c1ccccc1", "R1": "Cl"}, ...]

GCN Predictor with MC Dropout Uncertainty

from molcore.predictor import PropertyPredictor

pred = PropertyPredictor(hidden=64, n_layers=3, epochs=100, dropout=0.1)
pred.fit(train_ds, val_dataset=val_ds, verbose=True)

predictions = pred.predict(smiles_list)                          # numpy array
means, stds = pred.predict_with_uncertainty(smiles_list, n_samples=30)

pred.save("logp_model.pt")
pred2 = PropertyPredictor.load("logp_model.pt")

Drug-Target Interaction Prediction

from molcore import DTIDataset, DTIPredictor

ds = DTIDataset(
    smiles    = ["CC(=O)O",    "c1ccccc1"],
    sequences = ["MKTLLILAVL", "ACDEFGHIKL"],
    labels    = [6.5,           7.2],          # pIC50
)

train, val, test = ds.scaffold_split(train_frac=0.8, val_frac=0.1)

pred = DTIPredictor(hidden=64, n_layers=3, epochs=100, model_type="gcn")
pred.fit(train, val_dataset=val)

affinities = pred.predict(["CCO"], ["MKTLLILAVL"])   # (N,) float32 pIC50
metrics    = pred.score(test)                         # {r2, mae, rmse, n}

model_type accepts "gcn", "gat", or "gin". ESM-2 protein embeddings are available via pip install molcore-chem[bio].


Installation

pip install molcore-chem

Requires Python 3.11+. RDKit and PyTorch are declared dependencies — no manual conda setup required. Pre-compiled Rust extensions are included in the wheel.

GPU (CUDA 12.1):

pip install molcore-chem
pip install torch --index-url https://download.pytorch.org/whl/cu121

Build from Source

git clone https://github.com/Anteneh-T-Tessema/molcore
cd molcore
./setup_dev.sh    # creates .venv, builds Rust extension, runs tests
source .venv/bin/activate

Requires Rust 1.70+:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Architecture

SMILES strings
  │
  ▼  Rust ingest (RDKit-backed aromaticity perception)
  │  — sanitize, kekulize, ring perception, implicit H
  ▼
petgraph StableGraph (immutable after construction)
  │
  ├─▶ ecfp4_batch()          → (N × 2048) uint8  ─▶ torch.from_numpy()  ─▶ Tensor
  │   Rayon parallel · u64 bit-pack · hardware popcount · 35–132× faster
  │
  ├─▶ mol_to_graph_arrays()  → node_feats (9-dim), edge_index, edge_attr ─▶ PyG Data
  │   Zero-copy IntoPyArray · 4.3× faster than manual Python construction
  │
  └─▶ tanimoto_matrix()      → (Q × L) float32
      Rayon parallel · u64 popcount · 29× faster at scale

Python layer (molcore/)
  molecule.py      — frozen Mol dataclass (FrozenInstanceError on mutation)
  pipeline.py      — featurize_smiles() batch-first entry point
  rdkit_bridge.py  — all RDKit calls isolated here (one file to update)
  io.py            — MolDataset: SDF + Parquet + DataFrame bridge
  predictor.py     — PropertyPredictor: 3-layer GCN + MC Dropout
  dti.py           — DTIPredictor: GCN/GAT/GIN ligand + 1D-CNN protein encoder
  pandas_tools.py  — DataFrame-first API for existing RDKit workflows
  agentic_rag.py   — ChemRAG: iterative chemical literature retrieval

Design Invariants

  1. Mol is always immutable — transforms return new instances.
  2. RDKit is never in hot paths — all RDKit calls are isolated to rdkit_bridge.py.
  3. All Rust→Python array transfers use IntoPyArray — no Python-side copy loops.
  4. Batch API is primary — per-molecule methods are convenience wrappers.
  5. Backend flags are explicit — "rust" or "rdkit" is always caller-supplied.

Development

maturin develop --release --features extension-module   # build Rust extension
cargo test -p molcore-core                              # Rust unit tests
pytest tests/ evals/ -q                                 # 1061 Python/eval tests
python benchmarks/prove_scale.py                        # throughput benchmark (JSON)
python benchmarks/bench_e2e.py --n 1000                 # end-to-end benchmark
ruff check molcore/                                     # lint

Documentation


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molcore_chem-0.7.0.tar.gz (155.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

molcore_chem-0.7.0-cp312-cp312-macosx_11_0_arm64.whl (522.0 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

molcore_chem-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl (529.5 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

molcore_chem-0.7.0-cp311-cp311-macosx_11_0_arm64.whl (521.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

molcore_chem-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl (529.5 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file molcore_chem-0.7.0.tar.gz.

File metadata

  • Download URL: molcore_chem-0.7.0.tar.gz
  • Upload date:
  • Size: 155.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for molcore_chem-0.7.0.tar.gz
Algorithm Hash digest
SHA256 a28771270220dad149ec75c5ed75cdfa6a482bcfb1ec007487352a8a16e6f2af
MD5 8e1ac5557b1c9f07d93af2072fa39a7d
BLAKE2b-256 49ecfd01d80eb550b68dc33335e5739f6f1ece4ad83bb8661461a5fa3dbe5206

See more details on using hashes here.

File details

Details for the file molcore_chem-0.7.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for molcore_chem-0.7.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 708d2021241855fd2468b0b3cd2669d4637b5566ac5ac67fa4402256c501e816
MD5 e930809583f9c1007b2c6016342be505
BLAKE2b-256 49d8bb7696e483655377629b8c8572ab9843a1664a88d3aed5cb7267566e5586

See more details on using hashes here.

File details

Details for the file molcore_chem-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for molcore_chem-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3d3d48035b302b34dda122a3b300cc3dfc9baf94dece35a5570183cc92d7ade5
MD5 f07efcfbcefb5573fc3687bdb8e5782b
BLAKE2b-256 2d69c507503d6274dfb7249ce05d2912c78a4090423ad5d96a1f23a1c55feb00

See more details on using hashes here.

File details

Details for the file molcore_chem-0.7.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for molcore_chem-0.7.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4dbc4107fe86b43c3bf2fbc27eaf4363eb6d04f9e7aa58a5dbbacf9c70b8bb78
MD5 0e63c9efb5256bb8e11b414fea032c90
BLAKE2b-256 d46f6f31c49138c4a610522a2b75603a55de743281249e08f8a299c655f004a4

See more details on using hashes here.

File details

Details for the file molcore_chem-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for molcore_chem-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 84eb0d24fa3f5816856ec215a323a8f1722bf1e715954f37b0a57fc9a5fb095e
MD5 a59cf5c6d7f1c48a95ab21638c516655
BLAKE2b-256 08517a307fb10fc51ada789e48d1ae619a1780774c6f28f3fb99ed6af4f2888f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page