Skip to main content

QC, leakage detection, split design, and claim validation for single-cell perturbation studies.

Project description

PerturbGuard

CI Python Release License

FastQC-style guardrails for single-cell perturbation datasets, splits, claims, and model benchmarks.

PerturbGuard helps you answer the uncomfortable but necessary question:

Is this perturbation benchmark actually valid, or did leakage, controls, confounding, weak target effects, or split design make it look better than it is?

It works with AnnData .h5ad files and produces interactive HTML reports, machine-readable CSV tables, JSON summaries, and Markdown dataset cards.

Why Use It

Perturbation models are easy to overclaim. A model can appear to generalize because the split leaked perturbations, a batch variable predicts the label, controls are missing, target effects failed, or a drug target was treated as a single gene when it is really a pathway/class.

PerturbGuard audits those failure modes before you trust a dataset, split, claim, or benchmark.

Install

From GitHub:

pip install "perturbguard @ git+https://github.com/prithvirajanR/perturbguard.git@v1.0.0"

For local development:

git clone https://github.com/prithvirajanR/perturbguard.git
cd perturbguard
pip install -e ".[dev]"
pytest -q

Quickstart

perturbguard simulate --scenario batch_confounded --out data/demo/batch_confounded.h5ad
perturbguard audit --data data/demo/batch_confounded.h5ad --out results/demo_audit

Open:

results/demo_audit/report.html

You get a searchable report with status filters, summary cards, linked plots, recommendations, and CSV tables under results/demo_audit/tables/.

Common Workflows

Audit a raw public dataset

perturbguard infer-config --data data/raw.h5ad --out configs/inferred.yaml
perturbguard repair --data data/raw.h5ad --config configs/inferred.yaml --out data/repaired.h5ad
perturbguard audit --data data/repaired.h5ad --config configs/inferred.yaml --out results/audit

Generate and validate a split

perturbguard split \
  --data data/repaired.h5ad \
  --strategy leave-perturbation-out \
  --out results/split

perturbguard claim \
  --data data/repaired.h5ad \
  --split results/split/split.csv \
  --claim unseen_perturbation \
  --out results/claim

Evaluate model predictions

perturbguard evaluate \
  --data data/repaired.h5ad \
  --predictions predictions.csv \
  --out results/evaluation

Feature Map

Area What PerturbGuard Does
AnnData validation Checks loadability, matrix presence, cell/gene counts, controls, perturbation metadata, duplicate obs/var names, and malformed files.
Config inference Suggests YAML schema mappings from common metadata aliases in messy public datasets.
Repair Writes normalized .h5ad files with canonical metadata, unique indices, inferred controls, and repair action logs.
QC audit Audits perturbation support, controls, cell counts, confounding, metadata shortcuts, target effects, guide consistency, and target mapping.
Split generation Creates random, leave-target-gene-out, leave-perturbation-out, metadata holdout, strict combination, and seen-component combination splits.
Claim checking Verifies whether a split supports claims like unseen perturbation, unseen target gene, or unseen combinations.
Leakage detection Detects train/val/test leakage, combination leakage, unsupported claims, split imbalance, and invalid split labels.
Model evaluation Audits prediction CSVs for overall metrics, per-group performance, and confidence calibration.
Adversarial checks Tests whether metadata-only shortcuts can predict perturbation identity.
Target mapping Classifies targets as measured genes, drug target classes, pathway/class annotations, missing, or unmapped.
Design planning Checks planned cells, controls, replicate support, and batch support before running an experiment.
Benchmark manifests Validates dataset/split/claim/model/metrics manifests and runs claim-support checks.
Large files Profiles .h5ad files in backed mode before expensive audits.
Dataset cards Generates Markdown dataset cards with audit counts, uses, and limitations.
Reports Writes interactive HTML reports, plot links, CSV tables, summary.json, and recommendations.

CLI Commands

perturbguard simulate
perturbguard validate
perturbguard audit
perturbguard split
perturbguard claim
perturbguard evaluate
perturbguard repair
perturbguard infer-config
perturbguard target-map
perturbguard compare-datasets
perturbguard design-check
perturbguard power-check
perturbguard benchmark-check
perturbguard profile-large
perturbguard adversarial-check
perturbguard dataset-card

Input Formats

AnnData

Required for full audits:

  • perturbation
  • is_control, or enough configured control labels to infer controls

Recommended:

  • target_gene
  • guide_id
  • perturbation_type
  • batch
  • replicate
  • cell_type
  • donor
  • plate
  • timepoint
  • dose

Prediction CSV

Required for perturbguard evaluate:

  • cell_id
  • y_true
  • y_pred

Optional:

  • confidence

Target Mapping CSV

Recommended for perturbguard target-map:

  • perturbation
  • target
  • optional target_type
  • optional source

Benchmark Manifest

dataset: data/repaired.h5ad
split: results/split/split.csv
claim: unseen_perturbation
model:
  name: my-model
metrics:
  - accuracy
  - macro_f1

Real Data Smoke Test

PerturbGuard has been smoke-tested locally on public example AnnData files for SciPlex, Jorge, and LINCS-style perturbation datasets.

Example SciPlex run:

Invoke-WebRequest `
  -Uri "https://huggingface.co/cyclopeta/PerturbNet_reproduce/resolve/main/example_data/sciplex_example.h5ad" `
  -OutFile data/real/sciplex_example.h5ad

perturbguard audit `
  --data data/real/sciplex_example.h5ad `
  --config configs/sciplex_example.yaml `
  --out results/real/sciplex_audit

Full smoke matrix:

python scripts/run_smoke_matrix.py --include-real

What It Is Not

PerturbGuard is not a perturbation prediction model and does not claim biological truth for unmeasured perturbations. It is a guardrail system: it tells you when your dataset, split, controls, metadata, or benchmark claim may not support the conclusion you want to draw.

Release

Current stable release: 1.0.0

Built and verified with:

  • Python 3.11+
  • AnnData
  • pandas / NumPy / SciPy
  • scikit-learn
  • Plotly
  • Typer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perturbguard-1.0.0.tar.gz (52.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

perturbguard-1.0.0-py3-none-any.whl (55.8 kB view details)

Uploaded Python 3

File details

Details for the file perturbguard-1.0.0.tar.gz.

File metadata

  • Download URL: perturbguard-1.0.0.tar.gz
  • Upload date:
  • Size: 52.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for perturbguard-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ad80ab26e19d0fd70dbe06eebf0e0bba881706895ea140e22620994d17923df4
MD5 1f2b823c2d0b7d7b9ad7b4b859b9935c
BLAKE2b-256 03a47210f7bd2681285927568a9987d537e8d72bff47d168f9781730c88dbb4c

See more details on using hashes here.

File details

Details for the file perturbguard-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: perturbguard-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 55.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for perturbguard-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 20a8a3b3eed453f8d04c029f95436160de8e330f63640293d1c44ac338c159b7
MD5 b40a1cef1e6134890cc897d7515e0791
BLAKE2b-256 957f0d8f69f64ec60679c59140e1e58d3526a4eee9925b2a64b30bad50fcbe23

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page