Microstate version-control system for drug discovery workflows
Project description
MicrostateLedger
MicrostateLedger is a microstate version-control system for drug discovery workflows. It turns protonation, tautomer, stereochemistry, conformation, and mapping decisions into first-class tracked objects.
The project is designed for pipelines that span multiple tools (RDKit -> Docking -> MD -> QM) and need traceability, reproducibility, and team-safe collaboration.
Table of Contents
- 1. Why MicrostateLedger
- 2. Core Capabilities
- 3. System Architecture
- 4. Repository Layout
- 5. Requirements
- 6. Installation
- 7. Quick Start
- 8. Configuration
- 9. CLI Command Guide
- 10. Data Model and Stable IDs
- 11. Atom Mapping and Reversibility
- 12. Reproducibility and Auditability
- 13. Optional Engines and Graceful Degradation
- 14. License
1. Why MicrostateLedger
In practical molecular workflows, the same compound can appear in many chemically distinct states:
- Protonation states
- Tautomers
- Stereochemical variants (including undefined centers)
- Multiple conformers
- Tool-specific atom indexing or topology representations
Without strict tracking, teams frequently lose consistency between stages. MicrostateLedger solves this by introducing ledger-backed object lineage and stable IDs across the full workflow.
2. Core Capabilities
- Stable IDs for compounds, microstates, conformers, poses, receptors, and artifacts
- Full provenance in SQLite (
runs,decisions,anomalies,edges,artifacts) - Policy-driven enumeration and pruning (
Top-k, caps, stage-specific rules) - Sidecar atom mapping for cross-software consistency
- Diff utilities for chemistry, geometry, charge, and pipeline execution
- Batch execution with resume support and crash-safe state files
- Optional DVC/DataLad/lwreg integrations for data governance
3. System Architecture
MicrostateLedger is organized in three layers:
- Ledger Core (
msl/)
- CLI, DB schema access, config handling, ID generation, and run/audit recording.
- Tool Drivers (
scripts/)
- RDKit ingest/enumeration, conformer generation, docking prep/ingest helpers, charge tools, MD feedback checks.
- Object Store (
objects/,runs/,reports/)
- All generated artifacts, run outputs, and report files.
Each workflow action writes both files and ledger records, so every output can be traced to inputs, policy, tool, and run context.
4. Repository Layout
msl/: Python package sourcescripts/: stage driver scriptsschemas/: SQL schema definitionspolicies/: policy templatestests/: unit and regression testsbin/: wrapper scripts for environment-isolated executionmsl.toml: default project configREADME.md: user guide
5. Requirements
Minimum:
- Linux (recommended)
- Python 3.10+
- SQLite 3
Optional tooling by stage:
- RDKit, Dimorphite (enumeration/perception)
- Meeko, AutoDock Vina (docking)
- OpenFF Interchange, OpenMM (MD parameterization/smoke runs)
- PyPE_RESP / external QM tools (charge workflows)
- DVC, DataLad, lwreg (governance integrations)
SQLite operational notes:
- Prefer local SSD paths for
ledger.sqlite; SQLite on NFS/shared mounts may have lock latency. - For high parallelism, use one ledger per job and merge results later via exported artifacts/provenance.
- If you hit
database is locked, retry with serialized writers (or separate ledgers) instead of forcing shared writes.
6. Installation
6.1 Install from PyPI (recommended for users)
python -m pip install MicrostateLedger
CLI entrypoint after install:
msl --help
6.2 Install from source (development)
git clone https://github.com/woshuizhaol/MicrostateLedger.git
cd MicrostateLedger
python -m pip install -e .
6.3 Optional extras
python -m pip install "MicrostateLedger[cheminformatics]"
python -m pip install "MicrostateLedger[docking]"
python -m pip install "MicrostateLedger[md]"
python -m pip install "MicrostateLedger[charges]"
python -m pip install "MicrostateLedger[repro]"
6.4 Shared-server wrapper mode
If your team uses per-tool isolated environments, use wrappers in bin/:
./bin/msl --help
This mode is useful on shared compute servers where tools live in different envs.
7. Quick Start
7.1 Initialize the ledger
msl init
7.2 Ingest and perceive
msl ingest "CCO"
msl perceive <compound_id>
7.3 Enumerate microstates
msl enumerate <compound_id>
7.4 Generate conformers
msl conformers <microstate_id> --n 50
7.5 Docking preparation and ingest
msl dock-prep <microstate_id> <conformer_id>
msl dock-ingest <conformer_id> <pose.pdbqt> --score -7.5
msl dock-select <microstate_id> --k 20
msl select-final <microstate_id> --pose-id <pose_id>
7.6 Charge and MD setup
msl charge <microstate_id> --method pype_resp --auto-qm --conformer-id <conformer_id>
msl md-param <microstate_id> --charges-json <charges.json>
7.7 MD feedback and transition tracking
msl md-feedback <microstate_id> <probe.sdf>
msl md-transition <microstate_id> <probe.sdf>
7.8 One-command demo
bash scripts/demo_full_pipeline.sh "CCO" demo_ethanol
8. Configuration
The default project config file is msl.toml.
Typical keys include:
ledger_db: SQLite pathobjects_dir: artifact rootruns_dir: run output rootreports_dir: reports rootpolicy_path: active policy YAMLenvs.<stage>: per-stage environment selection
Policy behavior is controlled through policies/default.yaml, including:
- pH range and enumeration limits
- top-k/cap constraints
- stage-level keep/drop behavior
- optional fallback behavior when tools are missing
9. CLI Command Guide
9.1 Core lifecycle
msl init: initialize ledger databasemsl migrate: apply schema migrationsmsl ingest: register a standardized compoundmsl perceive: generate risk signals before expansionmsl enumerate: generate microstatesmsl conformers: create conformers
9.2 Docking and selection
msl receptor-add: register receptor structuremsl dock-prep: export docking-ready ligand + mapping sidecarmsl dock-ingest: register docking poses and scoresmsl dock-select: rank/select top posesmsl select-final: mark final microstate candidate for downstream
9.3 Charges and MD
msl charge: generate/import per-atom chargesmsl md-param: build MD-ready system artifactsmsl md-feedback: detect MD anomalies and record suggestionsmsl md-transition: record observed state transition
9.4 Diff and provenance
msl diff-microstate: compare chemistry-level state definitionsmsl diff-conformer: compare geometry (RMSD/torsions)msl diff-charge: compare charge vectors by canonical atom idsmsl diff-pipeline: compare run-level stage/tool/paramsmsl prov-export: export provenance as PROV-JSON
9.5 Maintenance and ecosystem
msl clean-invalid-microstates: remove sanitize-failed recordsmsl batch: resumable batch executionmsl dvc-track,msl dvc-init: DVC utilitiesmsl datalad-init,msl datalad-save: DataLad utilitiesmsl lwreg-init,msl lwreg-register: lwreg integrationmsl demo: full demonstration pipeline
10. Data Model and Stable IDs
Core entities:
CompoundMicrostateConformerPoseSystemDecisionAnomalyEdgeArtifact
ID strategy highlights:
CompoundID: derived from registration hashMicrostateID: derived fromfixedH_inchi + charge + stereo + coordination signatureConformerID/PoseID/ArtifactID: derived from content hash
This allows deterministic identity under fixed policy/tooling and explicit tracking of intended variability.
11. Atom Mapping and Reversibility
MicrostateLedger uses sidecar mapping files to preserve canonical atom identity across tool conversions.
Typical artifacts:
- Docking:
ligand.map.json - MD/topology:
atommap.json
Goal:
- A canonical atom id can be traced from RDKit representation to docking and MD representations.
- Mapping evidence remains attached to run/artifact records in the ledger.
12. Reproducibility and Auditability
MicrostateLedger records provenance at each stage:
runs: stage, tool, params, statusartifacts: path + hash linkagedecisions: why objects were kept/droppedanomalies: what failed/drifted and suggested next actionsedges: graph lineage between objects
For reproducibility analysis:
- Use repeated runs under fixed seed/policy and compare IDs/artifact hashes
- Export pipeline diffs and PROV-JSON for external review
13. Optional Engines and Graceful Degradation
Some stages are optional and may be unavailable in minimal installs.
Expected behavior:
- Core stages still run where dependencies exist.
- Missing optional tools should produce explicit decisions/logs instead of silent failure.
- You can combine minimal core usage with selectively enabled advanced stages.
14. License
See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file microstateledger-0.1.1.tar.gz.
File metadata
- Download URL: microstateledger-0.1.1.tar.gz
- Upload date:
- Size: 125.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bff6307a3901876bf8ab4612573ee3c455abd24f611e7f98e184359f0fcc483
|
|
| MD5 |
49b3b60a16d6f1adc7ee8ed1cb0076f7
|
|
| BLAKE2b-256 |
a23aadd349f5acffe5ff4ef3b047eb288c1ae030e6b908a93a3de71a58bd82c5
|
File details
Details for the file microstateledger-0.1.1-py3-none-any.whl.
File metadata
- Download URL: microstateledger-0.1.1-py3-none-any.whl
- Upload date:
- Size: 91.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6d816a6665713f4624ae641db4dde5409a674e9a3dc530a75be739323456742
|
|
| MD5 |
0162f35fa4888f9b27355fb410c4780f
|
|
| BLAKE2b-256 |
3fe3b9b935159a406984f55453860bfefeb84c7a084bfb1c7933688e351830dc
|