AI-native cheminformatics: Rust core + RDKit bridge + Python AI API
Project description
molcore-chem
AI-native cheminformatics toolkit — Rust-accelerated fingerprints and PyG conversion, with full RDKit compatibility and a built-in MCP server.
pip install molcore-chem
Overview
molcore extends RDKit workflows rather than replacing them. The hot paths — fingerprint generation and PyTorch Geometric graph conversion — are rewritten in Rust using Rayon parallelism and zero-copy array transfer, while standardization, descriptors, and scaffold splitting delegate to RDKit through an isolated bridge layer.
| Capability | Implementation | Notes |
|---|---|---|
| ECFP4 fingerprints | Rust (Rayon + u64 bit-packing) | 35–132× faster than RDKit |
| PyG graph conversion | Rust (IntoPyArray → torch.from_numpy) | 4.3× faster, zero-copy |
| Tanimoto matrix | Rust (Rayon + popcount) | 4.3–29× faster at scale |
| Standardization, descriptors, scaffold split | RDKit (via rdkit_bridge.py) | Parity speed, cleaner API |
Quickstart
from molcore.molecule import Mol
from molcore.pipeline import featurize_smiles
from molcore.predictor import PropertyPredictor
from molcore.io import MolDataset
import numpy as np
# Parse — immutable, Rust-backed
mol = Mol.from_smiles("CC(=O)Oc1ccccc1C(=O)O") # aspirin
data = mol.to_pyg() # PyG Data, zero-copy, 9 node features
# Batch fingerprints — Rust Rayon parallel
fps = featurize_smiles(smiles_list, backend="rust") # (N, 2048) uint8 Tensor
# Full dataset pipeline
ds = MolDataset.from_smiles(smiles_list, compute_fps=True, compute_desc=True)
ds.labels = np.array(logp_values, dtype=np.float32)
train_ds, val_ds, test_ds = ds.scaffold_split()
# Train GCN with MC Dropout uncertainty
pred = PropertyPredictor(hidden=64, epochs=100)
pred.fit(train_ds, val_dataset=val_ds)
means, stds = pred.predict_with_uncertainty(["CCO", "c1ccccc1"], n_samples=30)
Benchmarks
All numbers on Apple M-series (arm64), CPU-only, Python 3.12.
ECFP4 Fingerprints
| Batch size | molcore (Rust) | RDKit | Speedup |
|---|---|---|---|
| 1 000 SMILES | 1.3M mol/s | 14 800 mol/s | 88× |
| 10 000 SMILES | 2.0M mol/s | 15 100 mol/s | 132× |
Tanimoto Similarity Matrix
| Query × Library | molcore (Rust) | RDKit BulkTanimoto | Speedup |
|---|---|---|---|
| 50 × 1 000 | 31M pairs/s | 7.3M pairs/s | 4.3× |
| 500 × 10 000 | 224M pairs/s | 7.7M pairs/s | 29× |
End-to-End Pre-training Pipeline (500 molecules)
| Step | molcore | RDKit | Speedup |
|---|---|---|---|
| Standardize | 242 ms | 225 ms | ~parity |
| ECFP4 fingerprints | 1.1 ms | 37.3 ms | 35× |
| 7 Lipinski descriptors | 124 ms | 114 ms | ~parity |
| Scaffold split | 33 ms | 35 ms | ~parity |
| PyG conversion (200 mols) | 3.3 ms | 14.4 ms | 4.3× |
GNN Property Prediction — ESOL Solubility
ESOL dataset (Delaney 2004, 1128 molecules), scaffold split. Scaffold split is substantially harder than the random split used in published MoleculeNet baselines — results are not directly comparable to the published RMSE ≈ 0.58.
| Configuration | RMSE | R² |
|---|---|---|
| GCN, hidden=64, 3 layers, 300 epochs | 1.038 | 0.727 |
| Optuna-tuned (30 trials): hidden=128, 4 layers | 1.090 | 0.709 |
Features
Billion-Scale Streaming Screen
Screen libraries that do not fit in RAM using any Iterable[str] of SMILES — file iterators,
database cursors, or generators. Peak memory is O(chunk_size × nbits/8).
from molcore.streaming import stream_screen, StreamingScreen
def from_file(path):
with open(path) as fh:
for line in fh:
yield line.strip().split()[0]
# Tanimoto similarity + SMARTS filter in a single pass
hits = stream_screen(
from_file("chembl_34.smi"),
query="c1ccc(N)cc1",
query_smarts="[NH2]",
threshold=0.4,
chunk_size=10_000,
progress=True,
)
for smiles, tanimoto_score in hits:
print(smiles, tanimoto_score)
# Stateful version — screen multiple chunks, inspect running stats
screen = StreamingScreen(query="c1ccc(N)cc1", threshold=0.4)
for chunk in my_chunks:
chunk_hits = screen.screen_chunk(chunk)
save_hits(chunk_hits)
print(screen.stats) # {n_screened, n_hits, hit_rate, elapsed_s, rate_mol_s}
MCP Server
Any MCP-compatible host (Claude Desktop, Continue, Cursor) can invoke molcore tools directly without a local Python installation.
molcore mcp # stdio transport
molcore mcp --transport http --port 8765 # HTTP transport
Claude Desktop — add to claude_desktop_config.json:
{
"mcpServers": {
"molcore": {
"command": "python",
"args": ["-m", "molcore.mcp_server"],
"env": {}
}
}
}
Nine tools are exposed: featurize, screen_smarts, screen_similarity, admet_screen,
synthesizability, generate, retro_score, active_suggest, and pareto_optimize.
SDF and Parquet I/O
from molcore.io import MolDataset
ds = MolDataset.from_sdf("library.sdf")
ds = MolDataset.from_sdf("library.sdf", compute_fps=True, compute_desc=True)
ds.write_sdf("output.sdf")
ds.write_parquet("library.parquet") # Arrow columnar, snappy-compressed
ds2 = MolDataset.read_parquet("library.parquet")
Pandas Integration
import molcore.pandas_tools as mpt
df = mpt.load_sdf("library.sdf") # DataFrame with 'Mol' + 'smiles' columns
df = mpt.add_descriptors(df, preset="lipinski") # MolWt, LogP, TPSA, HBD, HBA, …
df = mpt.add_fingerprints(df, kind="ecfp4") # adds 'fp' column
df = mpt.filter_by_smarts(df, "c1ccncc1") # substructure filter in-place
df = mpt.standardize_smiles(df) # strip salts → neutralize → canonical tautomer
Descriptors
from molcore.rdkit_bridge import calc_named_descriptors
arr, names = calc_named_descriptors(smiles, preset="lipinski") # 7 descriptors
arr, names = calc_named_descriptors(smiles, preset="druglike") # 15 descriptors
arr, names = calc_named_descriptors(smiles, preset="all") # ~200 descriptors
arr, names = calc_named_descriptors(smiles, names=["MolWt", "TPSA", "BertzCT"])
Returns (N, D) float32 arrays.
Fingerprint Types
fps = featurize_smiles(smiles, kind="ecfp4") # (N, 2048) — Rust parallel
fps = featurize_smiles(smiles, kind="maccs") # (N, 167)
fps = featurize_smiles(smiles, kind="atom_pairs") # (N, 2048)
fps = featurize_smiles(smiles, kind="topological_torsions") # (N, 2048)
fps = featurize_smiles(smiles, kind="rdkit") # (N, 2048) RDKit path-based
2D Depiction
mol = Mol.from_smiles("CC(=O)Oc1ccccc1C(=O)O")
mol # renders inline in Jupyter via _repr_svg_
mol.to_png("aspirin.png")
ds = MolDataset.from_sdf("library.sdf")
ds # renders 8-molecule grid inline
ds.draw_grid(n=20, mols_per_row=4)
Standardization
from molcore.rdkit_bridge import standardize
clean = standardize("[Na+].OC(=O)c1ccccc1") # → "OC(=O)c1ccccc1"
# strips salts → neutralizes charges → canonical tautomer → canonical SMILES
MCS and R-Group Decomposition
from molcore.rdkit_bridge import find_mcs, rgroup_decompose
smarts = find_mcs(["CC(=O)Oc1ccccc1", "CC(=O)Oc1ccc(F)cc1", "CC(=O)Oc1ccc(Cl)cc1"])
rows = rgroup_decompose("c1ccc([*:1])cc1", smiles_list)
# → [{"Core": "c1ccccc1", "R1": "F"}, {"Core": "c1ccccc1", "R1": "Cl"}, ...]
GCN Predictor with MC Dropout Uncertainty
from molcore.predictor import PropertyPredictor
pred = PropertyPredictor(hidden=64, n_layers=3, epochs=100, dropout=0.1)
pred.fit(train_ds, val_dataset=val_ds, verbose=True)
predictions = pred.predict(smiles_list) # numpy array
means, stds = pred.predict_with_uncertainty(smiles_list, n_samples=30)
pred.save("logp_model.pt")
pred2 = PropertyPredictor.load("logp_model.pt")
Drug-Target Interaction Prediction
from molcore import DTIDataset, DTIPredictor
ds = DTIDataset(
smiles = ["CC(=O)O", "c1ccccc1"],
sequences = ["MKTLLILAVL", "ACDEFGHIKL"],
labels = [6.5, 7.2], # pIC50
)
train, val, test = ds.scaffold_split(train_frac=0.8, val_frac=0.1)
pred = DTIPredictor(hidden=64, n_layers=3, epochs=100, model_type="gcn")
pred.fit(train, val_dataset=val)
affinities = pred.predict(["CCO"], ["MKTLLILAVL"]) # (N,) float32 pIC50
metrics = pred.score(test) # {r2, mae, rmse, n}
model_type accepts "gcn", "gat", or "gin". ESM-2 protein embeddings are available
via pip install molcore-chem[bio].
Installation
pip install molcore-chem
Requires Python 3.11+. RDKit and PyTorch are declared dependencies — no manual conda setup required. Pre-compiled Rust extensions are included in the wheel.
GPU (CUDA 12.1):
pip install molcore-chem
pip install torch --index-url https://download.pytorch.org/whl/cu121
Build from Source
git clone https://github.com/Anteneh-T-Tessema/molcore
cd molcore
./setup_dev.sh # creates .venv, builds Rust extension, runs tests
source .venv/bin/activate
Requires Rust 1.70+:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Architecture
SMILES strings
│
▼ Rust ingest (RDKit-backed aromaticity perception)
│ — sanitize, kekulize, ring perception, implicit H
▼
petgraph StableGraph (immutable after construction)
│
├─▶ ecfp4_batch() → (N × 2048) uint8 ─▶ torch.from_numpy() ─▶ Tensor
│ Rayon parallel · u64 bit-pack · hardware popcount · 35–132× faster
│
├─▶ mol_to_graph_arrays() → node_feats (9-dim), edge_index, edge_attr ─▶ PyG Data
│ Zero-copy IntoPyArray · 4.3× faster than manual Python construction
│
└─▶ tanimoto_matrix() → (Q × L) float32
Rayon parallel · u64 popcount · 29× faster at scale
Python layer (molcore/)
molecule.py — frozen Mol dataclass (FrozenInstanceError on mutation)
pipeline.py — featurize_smiles() batch-first entry point
rdkit_bridge.py — all RDKit calls isolated here (one file to update)
io.py — MolDataset: SDF + Parquet + DataFrame bridge
predictor.py — PropertyPredictor: 3-layer GCN + MC Dropout
dti.py — DTIPredictor: GCN/GAT/GIN ligand + 1D-CNN protein encoder
pandas_tools.py — DataFrame-first API for existing RDKit workflows
agentic_rag.py — ChemRAG: iterative chemical literature retrieval
Design Invariants
Molis always immutable — transforms return new instances.- RDKit is never in hot paths — all RDKit calls are isolated to
rdkit_bridge.py. - All Rust→Python array transfers use
IntoPyArray— no Python-side copy loops. - Batch API is primary — per-molecule methods are convenience wrappers.
- Backend flags are explicit —
"rust"or"rdkit"is always caller-supplied.
Development
maturin develop --release --features extension-module # build Rust extension
cargo test -p molcore-core # Rust unit tests
pytest tests/ evals/ -q # 1061 Python/eval tests
python benchmarks/prove_scale.py # throughput benchmark (JSON)
python benchmarks/bench_e2e.py --n 1000 # end-to-end benchmark
ruff check molcore/ # lint
Documentation
- Quickstart notebook — Open in Colab
- Migrating from RDKit — API mapping for common RDKit patterns
- End-to-end GNN example — ESOL solubility benchmark
- Virtual screening pipeline
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file molcore_chem-0.7.0.tar.gz.
File metadata
- Download URL: molcore_chem-0.7.0.tar.gz
- Upload date:
- Size: 155.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a28771270220dad149ec75c5ed75cdfa6a482bcfb1ec007487352a8a16e6f2af
|
|
| MD5 |
8e1ac5557b1c9f07d93af2072fa39a7d
|
|
| BLAKE2b-256 |
49ecfd01d80eb550b68dc33335e5739f6f1ece4ad83bb8661461a5fa3dbe5206
|
File details
Details for the file molcore_chem-0.7.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: molcore_chem-0.7.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 522.0 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
708d2021241855fd2468b0b3cd2669d4637b5566ac5ac67fa4402256c501e816
|
|
| MD5 |
e930809583f9c1007b2c6016342be505
|
|
| BLAKE2b-256 |
49d8bb7696e483655377629b8c8572ab9843a1664a88d3aed5cb7267566e5586
|
File details
Details for the file molcore_chem-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl.
File metadata
- Download URL: molcore_chem-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 529.5 kB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d3d48035b302b34dda122a3b300cc3dfc9baf94dece35a5570183cc92d7ade5
|
|
| MD5 |
f07efcfbcefb5573fc3687bdb8e5782b
|
|
| BLAKE2b-256 |
2d69c507503d6274dfb7249ce05d2912c78a4090423ad5d96a1f23a1c55feb00
|
File details
Details for the file molcore_chem-0.7.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: molcore_chem-0.7.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 521.9 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dbc4107fe86b43c3bf2fbc27eaf4363eb6d04f9e7aa58a5dbbacf9c70b8bb78
|
|
| MD5 |
0e63c9efb5256bb8e11b414fea032c90
|
|
| BLAKE2b-256 |
d46f6f31c49138c4a610522a2b75603a55de743281249e08f8a299c655f004a4
|
File details
Details for the file molcore_chem-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl.
File metadata
- Download URL: molcore_chem-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 529.5 kB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84eb0d24fa3f5816856ec215a323a8f1722bf1e715954f37b0a57fc9a5fb095e
|
|
| MD5 |
a59cf5c6d7f1c48a95ab21638c516655
|
|
| BLAKE2b-256 |
08517a307fb10fc51ada789e48d1ae619a1780774c6f28f3fb99ed6af4f2888f
|