AI-native cheminformatics: Rust core + RDKit bridge + Python AI API

Project description

molcore-chem

AI-native cheminformatics toolkit — Rust-accelerated fingerprints and PyG conversion, with full RDKit compatibility and a built-in MCP server.

pip install molcore-chem

Overview

molcore extends RDKit workflows rather than replacing them. The hot paths — fingerprint generation and PyTorch Geometric graph conversion — are rewritten in Rust using Rayon parallelism and zero-copy array transfer, while standardization, descriptors, and scaffold splitting delegate to RDKit through an isolated bridge layer.

Capability	Implementation	Notes
ECFP4 fingerprints	Rust (Rayon + u64 bit-packing)	35–132× faster than RDKit
PyG graph conversion	Rust (IntoPyArray → torch.from_numpy)	4.3× faster, zero-copy
Tanimoto matrix	Rust (Rayon + popcount)	4.3–29× faster at scale
Standardization, descriptors, scaffold split	RDKit (via rdkit_bridge.py)	Parity speed, cleaner API

Quickstart

from molcore.molecule import Mol
from molcore.pipeline import featurize_smiles
from molcore.predictor import PropertyPredictor
from molcore.io import MolDataset
import numpy as np

# Parse — immutable, Rust-backed
mol = Mol.from_smiles("CC(=O)Oc1ccccc1C(=O)O")   # aspirin
data = mol.to_pyg()                                # PyG Data, zero-copy, 9 node features

# Batch fingerprints — Rust Rayon parallel
fps = featurize_smiles(smiles_list, backend="rust")   # (N, 2048) uint8 Tensor

# Full dataset pipeline
ds = MolDataset.from_smiles(smiles_list, compute_fps=True, compute_desc=True)
ds.labels = np.array(logp_values, dtype=np.float32)
train_ds, val_ds, test_ds = ds.scaffold_split()

# Train GCN with MC Dropout uncertainty
pred = PropertyPredictor(hidden=64, epochs=100)
pred.fit(train_ds, val_dataset=val_ds)
means, stds = pred.predict_with_uncertainty(["CCO", "c1ccccc1"], n_samples=30)

Open in Colab →

Benchmarks

All numbers on Apple M-series (arm64), CPU-only, Python 3.12.

ECFP4 Fingerprints

Batch size	molcore (Rust)	RDKit	Speedup
1 000 SMILES	1.3M mol/s	14 800 mol/s	88×
10 000 SMILES	2.0M mol/s	15 100 mol/s	132×

Tanimoto Similarity Matrix

Query × Library	molcore (Rust)	RDKit BulkTanimoto	Speedup
50 × 1 000	31M pairs/s	7.3M pairs/s	4.3×
500 × 10 000	224M pairs/s	7.7M pairs/s	29×

End-to-End Pre-training Pipeline (500 molecules)

Step	molcore	RDKit	Speedup
Standardize	242 ms	225 ms	~parity
ECFP4 fingerprints	1.1 ms	37.3 ms	35×
7 Lipinski descriptors	124 ms	114 ms	~parity
Scaffold split	33 ms	35 ms	~parity
PyG conversion (200 mols)	3.3 ms	14.4 ms	4.3×

GNN Property Prediction — ESOL Solubility

ESOL dataset (Delaney 2004, 1128 molecules), scaffold split. Scaffold split is substantially harder than the random split used in published MoleculeNet baselines — results are not directly comparable to the published RMSE ≈ 0.58.

Configuration	RMSE	R²
GCN, hidden=64, 3 layers, 300 epochs	1.038	0.727
Optuna-tuned (30 trials): hidden=128, 4 layers	1.090	0.709

Features

Billion-Scale Streaming Screen

Screen libraries that do not fit in RAM using any Iterable[str] of SMILES — file iterators, database cursors, or generators. Peak memory is O(chunk_size × nbits/8).

from molcore.streaming import stream_screen, StreamingScreen

def from_file(path):
    with open(path) as fh:
        for line in fh:
            yield line.strip().split()[0]

# Tanimoto similarity + SMARTS filter in a single pass
hits = stream_screen(
    from_file("chembl_34.smi"),
    query="c1ccc(N)cc1",
    query_smarts="[NH2]",
    threshold=0.4,
    chunk_size=10_000,
    progress=True,
)
for smiles, tanimoto_score in hits:
    print(smiles, tanimoto_score)

# Stateful version — screen multiple chunks, inspect running stats
screen = StreamingScreen(query="c1ccc(N)cc1", threshold=0.4)
for chunk in my_chunks:
    chunk_hits = screen.screen_chunk(chunk)
    save_hits(chunk_hits)
print(screen.stats)  # {n_screened, n_hits, hit_rate, elapsed_s, rate_mol_s}

MCP Server

Any MCP-compatible host (Claude Desktop, Continue, Cursor) can invoke molcore tools directly without a local Python installation.

molcore mcp                                    # stdio transport
molcore mcp --transport http --port 8765       # HTTP transport

Claude Desktop — add to claude_desktop_config.json:

{
  "mcpServers": {
    "molcore": {
      "command": "python",
      "args": ["-m", "molcore.mcp_server"],
      "env": {}
    }
  }
}

Nine tools are exposed: featurize, screen_smarts, screen_similarity, admet_screen, synthesizability, generate, retro_score, active_suggest, and pareto_optimize.

SDF and Parquet I/O

from molcore.io import MolDataset

ds = MolDataset.from_sdf("library.sdf")
ds = MolDataset.from_sdf("library.sdf", compute_fps=True, compute_desc=True)
ds.write_sdf("output.sdf")
ds.write_parquet("library.parquet")           # Arrow columnar, snappy-compressed
ds2 = MolDataset.read_parquet("library.parquet")

Pandas Integration

import molcore.pandas_tools as mpt

df = mpt.load_sdf("library.sdf")                  # DataFrame with 'Mol' + 'smiles' columns
df = mpt.add_descriptors(df, preset="lipinski")   # MolWt, LogP, TPSA, HBD, HBA, …
df = mpt.add_fingerprints(df, kind="ecfp4")       # adds 'fp' column
df = mpt.filter_by_smarts(df, "c1ccncc1")         # substructure filter in-place
df = mpt.standardize_smiles(df)                   # strip salts → neutralize → canonical tautomer

Descriptors

from molcore.rdkit_bridge import calc_named_descriptors

arr, names = calc_named_descriptors(smiles, preset="lipinski")   # 7 descriptors
arr, names = calc_named_descriptors(smiles, preset="druglike")   # 15 descriptors
arr, names = calc_named_descriptors(smiles, preset="all")        # ~200 descriptors
arr, names = calc_named_descriptors(smiles, names=["MolWt", "TPSA", "BertzCT"])

Returns (N, D) float32 arrays.

Fingerprint Types

fps = featurize_smiles(smiles, kind="ecfp4")                # (N, 2048) — Rust parallel
fps = featurize_smiles(smiles, kind="maccs")                # (N, 167)
fps = featurize_smiles(smiles, kind="atom_pairs")           # (N, 2048)
fps = featurize_smiles(smiles, kind="topological_torsions") # (N, 2048)
fps = featurize_smiles(smiles, kind="rdkit")                # (N, 2048) RDKit path-based

2D Depiction

mol = Mol.from_smiles("CC(=O)Oc1ccccc1C(=O)O")
mol              # renders inline in Jupyter via _repr_svg_
mol.to_png("aspirin.png")

ds = MolDataset.from_sdf("library.sdf")
ds               # renders 8-molecule grid inline
ds.draw_grid(n=20, mols_per_row=4)

Standardization

from molcore.rdkit_bridge import standardize

clean = standardize("[Na+].OC(=O)c1ccccc1")   # → "OC(=O)c1ccccc1"
# strips salts → neutralizes charges → canonical tautomer → canonical SMILES

MCS and R-Group Decomposition

from molcore.rdkit_bridge import find_mcs, rgroup_decompose

smarts = find_mcs(["CC(=O)Oc1ccccc1", "CC(=O)Oc1ccc(F)cc1", "CC(=O)Oc1ccc(Cl)cc1"])

rows = rgroup_decompose("c1ccc([*:1])cc1", smiles_list)
# → [{"Core": "c1ccccc1", "R1": "F"}, {"Core": "c1ccccc1", "R1": "Cl"}, ...]

GCN Predictor with MC Dropout Uncertainty

from molcore.predictor import PropertyPredictor

pred = PropertyPredictor(hidden=64, n_layers=3, epochs=100, dropout=0.1)
pred.fit(train_ds, val_dataset=val_ds, verbose=True)

predictions = pred.predict(smiles_list)                          # numpy array
means, stds = pred.predict_with_uncertainty(smiles_list, n_samples=30)

pred.save("logp_model.pt")
pred2 = PropertyPredictor.load("logp_model.pt")

Drug-Target Interaction Prediction

from molcore import DTIDataset, DTIPredictor

ds = DTIDataset(
    smiles    = ["CC(=O)O",    "c1ccccc1"],
    sequences = ["MKTLLILAVL", "ACDEFGHIKL"],
    labels    = [6.5,           7.2],          # pIC50
)

train, val, test = ds.scaffold_split(train_frac=0.8, val_frac=0.1)

pred = DTIPredictor(hidden=64, n_layers=3, epochs=100, model_type="gcn")
pred.fit(train, val_dataset=val)

affinities = pred.predict(["CCO"], ["MKTLLILAVL"])   # (N,) float32 pIC50
metrics    = pred.score(test)                         # {r2, mae, rmse, n}

model_type accepts "gcn", "gat", or "gin". ESM-2 protein embeddings are available via pip install molcore-chem[bio].

Installation

pip install molcore-chem

Requires Python 3.11+. RDKit and PyTorch are declared dependencies — no manual conda setup required. Pre-compiled Rust extensions are included in the wheel.

GPU (CUDA 12.1):

pip install molcore-chem
pip install torch --index-url https://download.pytorch.org/whl/cu121

Build from Source

git clone https://github.com/Anteneh-T-Tessema/molcore
cd molcore
./setup_dev.sh    # creates .venv, builds Rust extension, runs tests
source .venv/bin/activate

Requires Rust 1.70+:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Architecture

SMILES strings
  │
  ▼  Rust ingest (RDKit-backed aromaticity perception)
  │  — sanitize, kekulize, ring perception, implicit H
  ▼
petgraph StableGraph (immutable after construction)
  │
  ├─▶ ecfp4_batch()          → (N × 2048) uint8  ─▶ torch.from_numpy()  ─▶ Tensor
  │   Rayon parallel · u64 bit-pack · hardware popcount · 35–132× faster
  │
  ├─▶ mol_to_graph_arrays()  → node_feats (9-dim), edge_index, edge_attr ─▶ PyG Data
  │   Zero-copy IntoPyArray · 4.3× faster than manual Python construction
  │
  └─▶ tanimoto_matrix()      → (Q × L) float32
      Rayon parallel · u64 popcount · 29× faster at scale

Python layer (molcore/)
  molecule.py      — frozen Mol dataclass (FrozenInstanceError on mutation)
  pipeline.py      — featurize_smiles() batch-first entry point
  rdkit_bridge.py  — all RDKit calls isolated here (one file to update)
  io.py            — MolDataset: SDF + Parquet + DataFrame bridge
  predictor.py     — PropertyPredictor: 3-layer GCN + MC Dropout
  dti.py           — DTIPredictor: GCN/GAT/GIN ligand + 1D-CNN protein encoder
  pandas_tools.py  — DataFrame-first API for existing RDKit workflows
  agentic_rag.py   — ChemRAG: iterative chemical literature retrieval

Design Invariants

Mol is always immutable — transforms return new instances.
RDKit is never in hot paths — all RDKit calls are isolated to rdkit_bridge.py.
All Rust→Python array transfers use IntoPyArray — no Python-side copy loops.
Batch API is primary — per-molecule methods are convenience wrappers.
Backend flags are explicit — "rust" or "rdkit" is always caller-supplied.

Development

maturin develop --release --features extension-module   # build Rust extension
cargo test -p molcore-core                              # Rust unit tests
pytest tests/ evals/ -q                                 # 1061 Python/eval tests
python benchmarks/prove_scale.py                        # throughput benchmark (JSON)
python benchmarks/bench_e2e.py --n 1000                 # end-to-end benchmark
ruff check molcore/                                     # lint

Documentation

Quickstart notebook — Open in Colab
Migrating from RDKit — API mapping for common RDKit patterns
End-to-end GNN example — ESOL solubility benchmark
Virtual screening pipeline

License

MIT — see LICENSE.

Project details

Release history Release notifications | RSS feed

This version

0.7.0

May 21, 2026

0.4.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molcore_chem-0.7.0.tar.gz (155.7 kB view details)

Uploaded May 21, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

molcore_chem-0.7.0-cp312-cp312-macosx_11_0_arm64.whl (522.0 kB view details)

Uploaded May 21, 2026 CPython 3.12macOS 11.0+ ARM64

molcore_chem-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl (529.5 kB view details)

Uploaded May 21, 2026 CPython 3.12macOS 10.12+ x86-64

molcore_chem-0.7.0-cp311-cp311-macosx_11_0_arm64.whl (521.9 kB view details)

Uploaded May 21, 2026 CPython 3.11macOS 11.0+ ARM64

molcore_chem-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl (529.5 kB view details)

Uploaded May 21, 2026 CPython 3.11macOS 10.12+ x86-64

File details

Details for the file molcore_chem-0.7.0.tar.gz.

File metadata

Download URL: molcore_chem-0.7.0.tar.gz
Upload date: May 21, 2026
Size: 155.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for molcore_chem-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`a28771270220dad149ec75c5ed75cdfa6a482bcfb1ec007487352a8a16e6f2af`
MD5	`8e1ac5557b1c9f07d93af2072fa39a7d`
BLAKE2b-256	`49ecfd01d80eb550b68dc33335e5739f6f1ece4ad83bb8661461a5fa3dbe5206`

See more details on using hashes here.

File details

Details for the file molcore_chem-0.7.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: molcore_chem-0.7.0-cp312-cp312-macosx_11_0_arm64.whl
Upload date: May 21, 2026
Size: 522.0 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for molcore_chem-0.7.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`708d2021241855fd2468b0b3cd2669d4637b5566ac5ac67fa4402256c501e816`
MD5	`e930809583f9c1007b2c6016342be505`
BLAKE2b-256	`49d8bb7696e483655377629b8c8572ab9843a1664a88d3aed5cb7267566e5586`

See more details on using hashes here.

File details

Details for the file molcore_chem-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

Download URL: molcore_chem-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl
Upload date: May 21, 2026
Size: 529.5 kB
Tags: CPython 3.12, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for molcore_chem-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`3d3d48035b302b34dda122a3b300cc3dfc9baf94dece35a5570183cc92d7ade5`
MD5	`f07efcfbcefb5573fc3687bdb8e5782b`
BLAKE2b-256	`2d69c507503d6274dfb7249ce05d2912c78a4090423ad5d96a1f23a1c55feb00`

See more details on using hashes here.

File details

Details for the file molcore_chem-0.7.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

Download URL: molcore_chem-0.7.0-cp311-cp311-macosx_11_0_arm64.whl
Upload date: May 21, 2026
Size: 521.9 kB
Tags: CPython 3.11, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for molcore_chem-0.7.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`4dbc4107fe86b43c3bf2fbc27eaf4363eb6d04f9e7aa58a5dbbacf9c70b8bb78`
MD5	`0e63c9efb5256bb8e11b414fea032c90`
BLAKE2b-256	`d46f6f31c49138c4a610522a2b75603a55de743281249e08f8a299c655f004a4`

See more details on using hashes here.

File details

Details for the file molcore_chem-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

Download URL: molcore_chem-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl
Upload date: May 21, 2026
Size: 529.5 kB
Tags: CPython 3.11, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for molcore_chem-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`84eb0d24fa3f5816856ec215a323a8f1722bf1e715954f37b0a57fc9a5fb095e`
MD5	`a59cf5c6d7f1c48a95ab21638c516655`
BLAKE2b-256	`08517a307fb10fc51ada789e48d1ae619a1780774c6f28f3fb99ed6af4f2888f`

See more details on using hashes here.

molcore-chem 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

molcore-chem

Overview

Quickstart

Benchmarks

ECFP4 Fingerprints

Tanimoto Similarity Matrix

End-to-End Pre-training Pipeline (500 molecules)

GNN Property Prediction — ESOL Solubility

Features

Billion-Scale Streaming Screen

MCP Server

SDF and Parquet I/O

Pandas Integration

Descriptors

Fingerprint Types

2D Depiction

Standardization

MCS and R-Group Decomposition

GCN Predictor with MC Dropout Uncertainty

Drug-Target Interaction Prediction

Installation

Build from Source

Architecture

Design Invariants

Development

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes