Skip to main content

Microstate version-control system for drug discovery workflows

Project description

MicrostateLedger

MicrostateLedger is a microstate version-control system for drug discovery workflows. It turns protonation, tautomer, stereochemistry, conformation, and mapping decisions into first-class tracked objects.

The project is designed for pipelines that span multiple tools (RDKit -> Docking -> MD -> QM) and need traceability, reproducibility, and team-safe collaboration.

Table of Contents

1. Why MicrostateLedger

In practical molecular workflows, the same compound can appear in many chemically distinct states:

  • Protonation states
  • Tautomers
  • Stereochemical variants (including undefined centers)
  • Multiple conformers
  • Tool-specific atom indexing or topology representations

Without strict tracking, teams frequently lose consistency between stages. MicrostateLedger solves this by introducing ledger-backed object lineage and stable IDs across the full workflow.

2. Core Capabilities

  • Stable IDs for compounds, microstates, conformers, poses, receptors, and artifacts
  • Full provenance in SQLite (runs, decisions, anomalies, edges, artifacts)
  • Policy-driven enumeration and pruning (Top-k, caps, stage-specific rules)
  • Sidecar atom mapping for cross-software consistency
  • Diff utilities for chemistry, geometry, charge, and pipeline execution
  • Batch execution with resume support and crash-safe state files
  • Optional DVC/DataLad/lwreg integrations for data governance

3. System Architecture

MicrostateLedger is organized in three layers:

  1. Ledger Core (msl/)
  • CLI, DB schema access, config handling, ID generation, and run/audit recording.
  1. Tool Drivers (scripts/)
  • RDKit ingest/enumeration, conformer generation, docking prep/ingest helpers, charge tools, MD feedback checks.
  1. Object Store (objects/, runs/, reports/)
  • All generated artifacts, run outputs, and report files.

Each workflow action writes both files and ledger records, so every output can be traced to inputs, policy, tool, and run context.

4. Repository Layout

  • msl/: Python package source
  • scripts/: stage driver scripts
  • schemas/: SQL schema definitions
  • policies/: policy templates
  • tests/: unit and regression tests
  • bin/: wrapper scripts for environment-isolated execution
  • msl.toml: default project config
  • README.md: user guide

5. Requirements

Minimum:

  • Linux (recommended)
  • Python 3.10+
  • SQLite 3

Optional tooling by stage:

  • RDKit, Dimorphite (enumeration/perception)
  • Meeko, AutoDock Vina (docking)
  • OpenFF Interchange, OpenMM (MD parameterization/smoke runs)
  • PyPE_RESP / external QM tools (charge workflows)
  • DVC, DataLad, lwreg (governance integrations)

SQLite operational notes:

  • Prefer local SSD paths for ledger.sqlite; SQLite on NFS/shared mounts may have lock latency.
  • For high parallelism, use one ledger per job and merge results later via exported artifacts/provenance.
  • If you hit database is locked, retry with serialized writers (or separate ledgers) instead of forcing shared writes.

6. Installation

6.1 Install from PyPI (recommended for users)

python -m pip install MicrostateLedger

CLI entrypoint after install:

msl --help

6.2 Install from source (development)

git clone https://github.com/woshuizhaol/MicrostateLedger.git
cd MicrostateLedger
python -m pip install -e .

6.3 Optional extras

python -m pip install "MicrostateLedger[cheminformatics]"
python -m pip install "MicrostateLedger[docking]"
python -m pip install "MicrostateLedger[md]"
python -m pip install "MicrostateLedger[charges]"
python -m pip install "MicrostateLedger[repro]"

6.4 Shared-server wrapper mode

If your team uses per-tool isolated environments, use wrappers in bin/:

./bin/msl --help

This mode is useful on shared compute servers where tools live in different envs.

7. Quick Start

7.1 Initialize the ledger

msl init

7.2 Ingest and perceive

msl ingest "CCO"
msl perceive <compound_id>

7.3 Enumerate microstates

msl enumerate <compound_id>

7.4 Generate conformers

msl conformers <microstate_id> --n 50

7.5 Docking preparation and ingest

msl dock-prep <microstate_id> <conformer_id>
msl dock-ingest <conformer_id> <pose.pdbqt> --score -7.5
msl dock-select <microstate_id> --k 20
msl select-final <microstate_id> --pose-id <pose_id>

7.6 Charge and MD setup

msl charge <microstate_id> --method pype_resp --auto-qm --conformer-id <conformer_id>
msl md-param <microstate_id> --charges-json <charges.json>

7.7 MD feedback and transition tracking

msl md-feedback <microstate_id> <probe.sdf>
msl md-transition <microstate_id> <probe.sdf>

7.8 One-command demo

bash scripts/demo_full_pipeline.sh "CCO" demo_ethanol

8. Configuration

The default project config file is msl.toml.

Typical keys include:

  • ledger_db: SQLite path
  • objects_dir: artifact root
  • runs_dir: run output root
  • reports_dir: reports root
  • policy_path: active policy YAML
  • envs.<stage>: per-stage environment selection

Policy behavior is controlled through policies/default.yaml, including:

  • pH range and enumeration limits
  • top-k/cap constraints
  • stage-level keep/drop behavior
  • optional fallback behavior when tools are missing

9. CLI Command Guide

9.1 Core lifecycle

  • msl init: initialize ledger database
  • msl migrate: apply schema migrations
  • msl ingest: register a standardized compound
  • msl perceive: generate risk signals before expansion
  • msl enumerate: generate microstates
  • msl conformers: create conformers

9.2 Docking and selection

  • msl receptor-add: register receptor structure
  • msl dock-prep: export docking-ready ligand + mapping sidecar
  • msl dock-ingest: register docking poses and scores
  • msl dock-select: rank/select top poses
  • msl select-final: mark final microstate candidate for downstream

9.3 Charges and MD

  • msl charge: generate/import per-atom charges
  • msl md-param: build MD-ready system artifacts
  • msl md-feedback: detect MD anomalies and record suggestions
  • msl md-transition: record observed state transition

9.4 Diff and provenance

  • msl diff-microstate: compare chemistry-level state definitions
  • msl diff-conformer: compare geometry (RMSD/torsions)
  • msl diff-charge: compare charge vectors by canonical atom ids
  • msl diff-pipeline: compare run-level stage/tool/params
  • msl prov-export: export provenance as PROV-JSON

9.5 Maintenance and ecosystem

  • msl clean-invalid-microstates: remove sanitize-failed records
  • msl batch: resumable batch execution
  • msl dvc-track, msl dvc-init: DVC utilities
  • msl datalad-init, msl datalad-save: DataLad utilities
  • msl lwreg-init, msl lwreg-register: lwreg integration
  • msl demo: full demonstration pipeline

10. Data Model and Stable IDs

Core entities:

  • Compound
  • Microstate
  • Conformer
  • Pose
  • System
  • Decision
  • Anomaly
  • Edge
  • Artifact

ID strategy highlights:

  • CompoundID: derived from registration hash
  • MicrostateID: derived from fixedH_inchi + charge + stereo + coordination signature
  • ConformerID/PoseID/ArtifactID: derived from content hash

This allows deterministic identity under fixed policy/tooling and explicit tracking of intended variability.

11. Atom Mapping and Reversibility

MicrostateLedger uses sidecar mapping files to preserve canonical atom identity across tool conversions.

Typical artifacts:

  • Docking: ligand.map.json
  • MD/topology: atommap.json

Goal:

  • A canonical atom id can be traced from RDKit representation to docking and MD representations.
  • Mapping evidence remains attached to run/artifact records in the ledger.

12. Reproducibility and Auditability

MicrostateLedger records provenance at each stage:

  • runs: stage, tool, params, status
  • artifacts: path + hash linkage
  • decisions: why objects were kept/dropped
  • anomalies: what failed/drifted and suggested next actions
  • edges: graph lineage between objects

For reproducibility analysis:

  • Use repeated runs under fixed seed/policy and compare IDs/artifact hashes
  • Export pipeline diffs and PROV-JSON for external review

13. Optional Engines and Graceful Degradation

Some stages are optional and may be unavailable in minimal installs.

Expected behavior:

  • Core stages still run where dependencies exist.
  • Missing optional tools should produce explicit decisions/logs instead of silent failure.
  • You can combine minimal core usage with selectively enabled advanced stages.

14. License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

microstateledger-0.1.1.tar.gz (125.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

microstateledger-0.1.1-py3-none-any.whl (91.2 kB view details)

Uploaded Python 3

File details

Details for the file microstateledger-0.1.1.tar.gz.

File metadata

  • Download URL: microstateledger-0.1.1.tar.gz
  • Upload date:
  • Size: 125.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for microstateledger-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7bff6307a3901876bf8ab4612573ee3c455abd24f611e7f98e184359f0fcc483
MD5 49b3b60a16d6f1adc7ee8ed1cb0076f7
BLAKE2b-256 a23aadd349f5acffe5ff4ef3b047eb288c1ae030e6b908a93a3de71a58bd82c5

See more details on using hashes here.

File details

Details for the file microstateledger-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for microstateledger-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c6d816a6665713f4624ae641db4dde5409a674e9a3dc530a75be739323456742
MD5 0162f35fa4888f9b27355fb410c4780f
BLAKE2b-256 3fe3b9b935159a406984f55453860bfefeb84c7a084bfb1c7933688e351830dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page