Benchmarking gene-indication evidence against clinical trial outcomes
Project description
An open framework for benchmarking gene-indication evidence against clinical trial outcomes.
ct-validation tests whether a set of gene-indication pairs is enriched for clinical success. It computes risk ratios and odds ratios with confidence intervals across clinical phase transitions and supports semantic disease matching through ontology-based similarity.
Paper: Kostiuk K, Igumnov D, Fedichev P, Feizi A. ct-validation: an open framework for benchmarking gene-indication evidence against clinical trial outcomes. (2026)
Installation
Requires Python 3.11+.
pip install ct-validation
Optional extras:
pip install ct-validation[plot] # forest plot visualization
pip install ct-validation[mcp] # MCP server for agent workflows
pip install ct-validation[parse] # data source parsers
pip install ct-validation[fetch] # ChEMBL fetching script dependencies
Quick start
Python API
import ct_validation as ctv
results = ctv.validate(
clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
targets="data/genetic_evidence/genetic_evidence.parquet",
similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
print(results)
# phase_label n_yes n_no rr rr_ci_lower rr_ci_upper ...
Batch mode — compare multiple evidence sources at once:
results = ctv.validate(
clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
targets=[
"data/genetic_evidence/gwas_catalog.parquet",
"data/genetic_evidence/clinvar.parquet",
"data/genetic_evidence/omim.parquet",
],
similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
# returns a list of DataFrames, one per evidence source
Prioritized mode — test whether a novel source adds value over an established baseline:
results = ctv.validate(
clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
targets="data/genetic_evidence/novel_score.parquet",
baseline_evidence="data/genetic_evidence/established_genetics.parquet",
similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
# pairs supported only by baseline are excluded
Expand a disease set using semantic similarity:
expanded = ctv.get_expanded_disease_set(
efo_ids={"EFO:0000270", "EFO:0000384"},
similarity_pairs="data/mappings/efo_similarity_lookup_0.5.parquet",
similarity_threshold=0.8,
)
CLI
# With config file
ct-validation --config configs/default.yaml
# With explicit arguments
ct-validation \
--clinical-trials ct.parquet \
--targets evidence.parquet \
--similarity-lookup similarity.parquet \
-o results/
# Batch mode (multiple evidence sources)
ct-validation \
--clinical-trials ct.parquet \
--targets gwas.parquet --targets clinvar.parquet --targets omim.parquet \
-o results/
MCP server
ct-validation-mcp
Exposes two tools for agent-based workflows:
ct_validate— compute phase-transition enrichmentexpand_disease_set— expand EFO IDs via semantic similarity
Input schemas
| Input | Columns | Description |
|---|---|---|
clinical_trials |
gene, efo_id, max_phase |
Target-indication pairs with highest phase reached |
targets |
gene, efo_id |
Gene-indication pairs with supporting evidence |
similarity_lookup |
efo_id_1, efo_id_2, similarity |
Pairwise EFO similarity (optional) |
baseline_evidence |
gene, efo_id |
Baseline evidence for prioritized mode (optional) |
gene_universe |
one gene per line (text file) | Restrict analysis to these genes (optional) |
All inputs accept Parquet files or pandas DataFrames (except gene_universe, which is a text file or a Python set).
Output schema
| Column | Description |
|---|---|
phase_from, phase_to |
Phase transition (e.g. 1→2, 1→4) |
n_yes, n_no |
Pairs entering phase (with/without evidence) |
x_yes, x_no |
Pairs reaching target phase |
rate_yes, rate_no |
Progression rates |
rr, rr_ci_lower, rr_ci_upper |
Risk ratio with 95% CI (Katz log method) |
or, or_ci_lower, or_ci_upper |
Odds ratio with 95% CI (Woolf logit method) |
Enrichment logic
For each phase transition, target-indication pairs that reached at least the starting phase are divided into supported and unsupported groups. The risk ratio is:
RR = (x_yes / n_yes) / (x_no / n_no)
A risk ratio greater than one indicates that genetically supported pairs are more likely to progress. When a similarity lookup is provided, a pair (gene, disease) is considered supported if there exists evidence (gene, disease') with similarity above the threshold (default 0.8).
Prioritized mode
When baseline_evidence is provided, pairs supported only by the baseline are excluded. This tests whether a novel evidence source adds predictive value beyond an established benchmark.
Visualization
import ct_validation as ctv
results = ctv.validate(...)
ctv.forest_plot(results, metric="rr", title="Phase I → Approved")
Data source parsers
The scripts/ directory contains reproducible parsers for public databases:
Genetic evidence (scripts/parse/genetic_evidence/):
- GWAS Catalog — genome-wide significant associations (p < 1e-8)
- ClinVar — pathogenic/likely pathogenic variants
- OMIM — established molecular basis (mapping code 3)
- Open Targets — genetic evidence streams (score ≥ 0.5)
- Genebass — exome-wide associations (p ≤ 1e-7)
Clinical trials (scripts/parse/clinical_trials/):
- ChEMBL — gene-drug and drug-indication links (pChEMBL > 7.0)
- Open Targets — known drug and indication data
- STITCH — high-confidence activation/inhibition links
- DGIdb — drug-gene interactions
- TrialPanorama — interventional studies
Ontology (scripts/r/):
- EFO semantic similarity matrix (Lin + Resnik information content)
See DATA_SOURCES.md for download links, versions, and fetching instructions.
Configure paths in configs/parsing.yaml and run:
python scripts/parse/run_parsing.py
Configuration
See configs/default.yaml for validation settings and configs/parsing.yaml for data source paths. All config values can be overridden via CLI arguments.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ct_validation-0.1.0.tar.gz.
File metadata
- Download URL: ct_validation-0.1.0.tar.gz
- Upload date:
- Size: 15.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c539e7cab929fc6c59d2c31a987f1517c9d778ba779c83f71977bb983a117ace
|
|
| MD5 |
3afbe4bf180a625529fa380873f6428b
|
|
| BLAKE2b-256 |
58afdc38f92e45ef92137dbedd6ff58ba256afd50a05d968a833641d32b8407c
|
File details
Details for the file ct_validation-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ct_validation-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab79392ed61d7c1dba644a06f0081da59c0156b9c8e554687aa1b24d670d1eca
|
|
| MD5 |
a7194477ac2e3626adab4431173d0070
|
|
| BLAKE2b-256 |
4f4e678ec5c02bf4eabb58b96b44115bbb1338e4eecc09d53e91b2c61c5d3c20
|