Skip to main content

Benchmarking gene-indication evidence against clinical trial outcomes

Project description

ct-validation

An open framework for benchmarking gene-indication evidence against clinical trial outcomes.

ct-validation tests whether a set of gene-indication pairs is enriched for clinical success. It computes risk ratios and odds ratios with confidence intervals across clinical phase transitions and supports semantic disease matching through ontology-based similarity.

Paper: Kostiuk K, Igumnov D, Fedichev P, Feizi A. ct-validation: an open framework for benchmarking gene-indication evidence against clinical trial outcomes. (2026)

Installation

Requires Python 3.11+.

pip install ct-validation

Optional extras:

pip install ct-validation[plot]  # forest plot visualization
pip install ct-validation[mcp]   # MCP server for agent workflows
pip install ct-validation[parse] # data source parsers
pip install ct-validation[fetch] # ChEMBL fetching script dependencies

Quick start

Python API

import ct_validation as ctv

results = ctv.validate(
    clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
    targets="data/genetic_evidence/genetic_evidence.parquet",
    similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
print(results)
#   phase_label  n_yes   n_no  rr  rr_ci_lower  rr_ci_upper  ...

Batch mode — compare multiple evidence sources at once:

results = ctv.validate(
    clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
    targets=[
        "data/genetic_evidence/gwas_catalog.parquet",
        "data/genetic_evidence/clinvar.parquet",
        "data/genetic_evidence/omim.parquet",
    ],
    similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
# returns a list of DataFrames, one per evidence source

Prioritized mode — test whether a novel source adds value over an established baseline:

results = ctv.validate(
    clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
    targets="data/genetic_evidence/novel_score.parquet",
    baseline_evidence="data/genetic_evidence/established_genetics.parquet",
    similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
# pairs supported only by baseline are excluded

Expand a disease set using semantic similarity:

expanded = ctv.get_expanded_disease_set(
    efo_ids={"EFO:0000270", "EFO:0000384"},
    similarity_pairs="data/mappings/efo_similarity_lookup_0.5.parquet",
    similarity_threshold=0.8,
)

CLI

# With config file
ct-validation --config configs/default.yaml

# With explicit arguments
ct-validation \
    --clinical-trials ct.parquet \
    --targets evidence.parquet \
    --similarity-lookup similarity.parquet \
    -o results/

# Batch mode (multiple evidence sources)
ct-validation \
    --clinical-trials ct.parquet \
    --targets gwas.parquet --targets clinvar.parquet --targets omim.parquet \
    -o results/

MCP server

ct-validation-mcp

Exposes two tools for agent-based workflows:

  • ct_validate — compute phase-transition enrichment
  • expand_disease_set — expand EFO IDs via semantic similarity

Input schemas

Input Columns Description
clinical_trials gene, efo_id, max_phase Target-indication pairs with highest phase reached
targets gene, efo_id Gene-indication pairs with supporting evidence
similarity_lookup efo_id_1, efo_id_2, similarity Pairwise EFO similarity (optional)
baseline_evidence gene, efo_id Baseline evidence for prioritized mode (optional)
gene_universe one gene per line (text file) Restrict analysis to these genes (optional)

All inputs accept Parquet files or pandas DataFrames (except gene_universe, which is a text file or a Python set).

Output schema

Column Description
phase_from, phase_to Phase transition (e.g. 1→2, 1→4)
n_yes, n_no Pairs entering phase (with/without evidence)
x_yes, x_no Pairs reaching target phase
rate_yes, rate_no Progression rates
rr, rr_ci_lower, rr_ci_upper Risk ratio with 95% CI (Katz log method)
or, or_ci_lower, or_ci_upper Odds ratio with 95% CI (Woolf logit method)

Enrichment logic

For each phase transition, target-indication pairs that reached at least the starting phase are divided into supported and unsupported groups. The risk ratio is:

RR = (x_yes / n_yes) / (x_no / n_no)

A risk ratio greater than one indicates that genetically supported pairs are more likely to progress. When a similarity lookup is provided, a pair (gene, disease) is considered supported if there exists evidence (gene, disease') with similarity above the threshold (default 0.8).

Prioritized mode

When baseline_evidence is provided, pairs supported only by the baseline are excluded. This tests whether a novel evidence source adds predictive value beyond an established benchmark.

Visualization

import ct_validation as ctv

results = ctv.validate(...)
ctv.forest_plot(results, metric="rr", title="Phase I → Approved")

Data source parsers

The scripts/ directory contains reproducible parsers for public databases:

Genetic evidence (scripts/parse/genetic_evidence/):

  • GWAS Catalog — genome-wide significant associations (p < 1e-8)
  • ClinVar — pathogenic/likely pathogenic variants
  • OMIM — established molecular basis (mapping code 3)
  • Open Targets — genetic evidence streams (score ≥ 0.5)
  • Genebass — exome-wide associations (p ≤ 1e-7)

Clinical trials (scripts/parse/clinical_trials/):

  • ChEMBL — gene-drug and drug-indication links (pChEMBL > 7.0)
  • Open Targets — known drug and indication data
  • STITCH — high-confidence activation/inhibition links
  • DGIdb — drug-gene interactions
  • TrialPanorama — interventional studies

Ontology (scripts/r/):

  • EFO semantic similarity matrix (Lin + Resnik information content)

See DATA_SOURCES.md for download links, versions, and fetching instructions.

Configure paths in configs/parsing.yaml and run:

python scripts/parse/run_parsing.py

Configuration

See configs/default.yaml for validation settings and configs/parsing.yaml for data source paths. All config values can be overridden via CLI arguments.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ct_validation-0.1.0.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ct_validation-0.1.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file ct_validation-0.1.0.tar.gz.

File metadata

  • Download URL: ct_validation-0.1.0.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ct_validation-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c539e7cab929fc6c59d2c31a987f1517c9d778ba779c83f71977bb983a117ace
MD5 3afbe4bf180a625529fa380873f6428b
BLAKE2b-256 58afdc38f92e45ef92137dbedd6ff58ba256afd50a05d968a833641d32b8407c

See more details on using hashes here.

File details

Details for the file ct_validation-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ct_validation-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ct_validation-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab79392ed61d7c1dba644a06f0081da59c0156b9c8e554687aa1b24d670d1eca
MD5 a7194477ac2e3626adab4431173d0070
BLAKE2b-256 4f4e678ec5c02bf4eabb58b96b44115bbb1338e4eecc09d53e91b2c61c5d3c20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page