Skip to main content

MCP server and Python toolkit for perception, rendering, and analysis of molecules and reaction schemes in ChemDraw CDXML.

Project description

cdxml-toolkit

Chemistry office automation toolkit with MCP (Model Context Protocol) server. Lets LLM agents draw reaction schemes, parse ELN exports, analyze LCMS data, and produce publication-ready ChemDraw (CDXML) output.

The goal: any chemist with a consumer GPU can run a local LLM agent that helps with routine chemistry office tasks. The toolkit provides 15 grounded, validated chemistry tools that LLMs call via MCP — the agent reasons about chemistry while the tools handle SMILES resolution, 2D coordinate generation, and CDXML layout.

Built and tested with Claude Code (Opus 4.6). I directed the design and architecture; Claude did the implementation. I'm a PhD organic chemist, not a programmer — this project wouldn't exist without Claude Code, and I thank Anthropic.

Installation

Prerequisites: Windows with ChemDraw (ChemOffice 2015+) installed. Python 3.10–3.13 (3.14 is not yet supported by TensorFlow/DECIMER).

# 1. Create a conda environment and install
conda create -n cdxml python=3.12 pip -y
conda activate cdxml
pip install cdxml-toolkit

# 2. Run the doctor to check your setup
cdxml-doctor --no-tests

Everything is included by default: RDKit, MCP server, ChemDraw COM, Office support, PDF analysis, image processing, DECIMER neural image extraction, OPSIN, and OCR.

On first run, cdxml-doctor will extract the bundled JRE for OPSIN (~45 MB, one-time) and download DECIMER neural models (~570 MB). Subsequent runs are fast.

If ChemScript is not configured, cdxml-doctor will detect your ChemDraw installation, show what it found, and offer to set everything up automatically:

=== ChemScript setup ===

  Found ChemScript DLLs:
    Managed:  C:\...\CambridgeSoft.ChemScript16.dll (32-bit)
    Native:   C:\...\ChemScript160.dll (32-bit)

  32-bit ChemScript requires a 32-bit Python environment.
  The doctor will run the following commands:

    set CONDA_SUBDIR=win-32 && conda create -n chemscript32 python=3.10 pip -y
    C:\Users\YOU\miniconda3\envs\chemscript32\python.exe -m pip install pythonnet

  Proceed? [y/N] y

  Creating chemscript32 conda env...
  chemscript32 env created.
  Installing pythonnet in chemscript32...
  pythonnet installed.
  Saving config...
  ChemScript configured. Run cdxml-doctor again to verify.

Run cdxml-doctor --no-tests again to confirm ChemScript shows OK.

ChemScript is optional — without it, OPSIN handles IUPAC name resolution as an offline fallback. ChemScript adds bidirectional name-to-structure conversion and aligned naming.

Alternatively, install from GitHub for the latest development version:

pip install "cdxml-toolkit @ git+https://github.com/leehiufung911/cdxml-toolkit.git@main"

MCP server (Claude Desktop)

The primary interface is the MCP server. Connect it to Claude Desktop and chat naturally: "Draw deucravacitinib", "Help me complete my lab book", "Extract structures from this image".

Open %APPDATA%\Claude\claude_desktop_config.json. It will look something like this:

{
  "preferences": {
    ...
  }
}

Add an "mcpServers" key at the top level, next to "preferences" (change YOUR_USERNAME to your Windows username):

{
  "mcpServers": {
    "cdxml-toolkit": {
      "command": "C:\\Users\\YOUR_USERNAME\\miniconda3\\envs\\cdxml\\python.exe",
      "args": ["-m", "cdxml_toolkit.mcp_server"]
    }
  },
  "preferences": {
    ...
  }
}

Restart Claude Desktop. Verify by asking:

> Resolve "aspirin", then draw it.

Expected: 2 tool calls (resolve_name, draw_molecule), produces an aspirin CDXML file.

The same config pattern works with other MCP-compatible agents (Claude Code, opencode, qwen-agent, etc.).

Agent instructions file

Copy CLAUDE.md from the repository root into your agent's working directory. This file contains critical rules that prevent the agent from hallucinating chemistry:

  • Never write SMILES from built-in knowledge or vision. Every molecule must come from a tool (resolve_name, modify_molecule, extract_structures_from_image, etc.).
  • Never use vision to identify molecular structures. Image reading can recognize that "this is a reaction scheme" but cannot reliably determine exact atom connectivity. Always use extract_structures_from_image for that — it runs DECIMER neural network OCR and returns validated SMILES.
  • Never edit SMILES directly. Use modify_molecule which provides an MCS diff to verify the change.

For Claude Code, name it CLAUDE.md. For other agents, use agents.md or whatever your framework reads as system instructions.

MCP tools (15)

Chemistry resolution

Tool Description
resolve_name Name/abbreviation/CAS/formula to rich molecule JSON (5-tier: reagent DB, condensed formula, ChemScript, OPSIN, PubChem)
modify_molecule 6 operations: analyze, name_surgery, smarts, set_smiles, set_name, reaction. 162 named reaction templates. Returns MCS-based structural diffs.

Structure rendering

Tool Description
draw_molecule Single molecule to CDXML
render_scheme YAML/compact text/reaction JSON to publication-ready CDXML. Forgiving parser handles common LLM YAML mistakes.

Perception (reading existing chemistry)

Tool Description
parse_reaction ELN exports (CDXML/CDX/CSV/RXN) to semantic JSON with species, roles, SMILES, equivalents
summarize_reaction Context-efficient view of reaction JSON (select only the fields you need)
extract_structures_from_image Image to SMILES + confidence scores via DECIMER neural network
parse_scheme CDXML scheme to structured species/steps/topology JSON

Analysis

Tool Description
parse_analysis_file LCMS (Waters/manual) or NMR (MestReNova) PDF to structured peak data
format_lab_entry Structured entry dicts to formatted lab book text. Re-reads LCMS PDFs for exact numbers.

Office integration

Tool Description
extract_cdxml_from_office Pull embedded ChemDraw OLE objects from PPTX/DOCX
embed_cdxml_in_office Inject CDXML as editable ChemDraw OLE into PPTX/DOCX
convert_cdx_cdxml Bidirectional CDX/CDXML conversion
search_compound Find a molecule across experiment directories by SMILES similarity
render_to_png CDXML to PNG via ChemDraw COM

Design principles

Never trust LLM-generated SMILES. The agent always goes through resolve_name to get grounded SMILES from databases. Direct SMILES generation is the #1 source of chemistry hallucination.

Verify every transformation. modify_molecule returns aligned IUPAC name diffs and MCS-based molecular diffs after every edit. The agent can confirm the transformation is correct.

Never flood the agent. Large outputs (CDXML, JSON) always write to files and return {ok: true, output_path: "...", size: 23456}. The agent never gets 30KB of XML in its context window.

Forgiving inputs. The YAML parser accepts 9+ common LLM mistakes (inline structures, substrates as alias for structures, text as string not list, bare SMILES, above_arrow as list/string). Input parameters accept bare SMILES strings, stringified JSON arrays, and fuzzy operation names.

Actionable errors. Every error tells the agent what to do instead: "Did you mean: BOC_deprotection?", not "KeyError".

Progressive discovery. Call any tool with no arguments to get usage examples and schema reference.

CLI tools

All tools are also available as command-line scripts:

Command Description
cdxml-mcp MCP server (primary interface)
cdxml-parse Parse reaction files to JSON
cdxml-render Render JSON/YAML/compact text to CDXML
cdxml-convert CDX/CDXML bidirectional conversion
cdxml-image CDXML to PNG/SVG (ChemDraw COM)
cdxml-merge Merge multiple reaction schemes
cdxml-layout Clean up reaction layout (pure Python)
cdxml-ole Embed CDXML as editable OLE in PPTX/DOCX
cdxml-lcms Parse LCMS PDF reports
cdxml-nmr Extract NMR data from MestReNova PDFs
cdxml-format-entry Format lab book entries
cdxml-discover Discover experiment files in a directory
cdxml-doctor Diagnostics, test runner, and ChemScript setup guide

Scheme DSL

The renderer accepts three input formats:

YAML (what agents typically write):

layout: sequential
structures:
  SM:
    smiles: "Brc1ncnc2sccc12"
  Product:
    smiles: "c1nc(N2CCOCC2)c2ccsc2n1"
steps:
  - substrates: [SM]
    products: [Product]
    above_arrow:
      structures: [Morph]
    below_arrow:
      text: ["Pd2(dba)3", "BINAP", "Cs2CO3", "Dioxane, 105 C"]

Compact text ("Mermaid for reactions"):

SM: {Brc1ncnc2sccc12}
SM --> Product{c1nc(N2CCOCC2)c2ccsc2n1}
  above: Morph{C1COCCN1}
  below: "Pd2(dba)3", "BINAP", "Cs2CO3"

Reaction JSON (from parse_reaction):

cdxml-render --from-json reaction.json -o scheme.cdxml

Running tests

# Using cdxml-doctor (recommended — also prints diagnostics)
cdxml-doctor

# Or directly with pytest
pytest tests/ -v

License

MIT

Attribution

See NOTICE.md for third-party data attribution (ChemScanner, RDKit).

Author

Hiu Fung Kevin Lee (@leehiufung911)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdxml_toolkit-0.5.17.tar.gz (48.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cdxml_toolkit-0.5.17-py3-none-any.whl (48.5 MB view details)

Uploaded Python 3

File details

Details for the file cdxml_toolkit-0.5.17.tar.gz.

File metadata

  • Download URL: cdxml_toolkit-0.5.17.tar.gz
  • Upload date:
  • Size: 48.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for cdxml_toolkit-0.5.17.tar.gz
Algorithm Hash digest
SHA256 ba060d0d45f5fa12a0a0fb24bd23220424845f001cfa02afe904214c38ea7e97
MD5 2a893233f87f2225df1e6456fbc8051d
BLAKE2b-256 ce72480af11e31d9040316206894a389459a7967df757055ac46ba73d3fce9a7

See more details on using hashes here.

File details

Details for the file cdxml_toolkit-0.5.17-py3-none-any.whl.

File metadata

  • Download URL: cdxml_toolkit-0.5.17-py3-none-any.whl
  • Upload date:
  • Size: 48.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for cdxml_toolkit-0.5.17-py3-none-any.whl
Algorithm Hash digest
SHA256 c1d125551a5ac3d717ef29d4d8e6bea8e12a3d0868e71aa2a75496b6d0614071
MD5 4617ecedc70ae76be66a5b084e9603f0
BLAKE2b-256 1fc21c024c011befe1848372b41dfff33c44ef718a6f72c79c18b67276ea4788

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page