A python toolkit for polygenic risk scoring and mendelian randomization.
Project description
genal
Polygenic risk scoring (PRS) and Mendelian randomization (MR) in Python
Quick links
- Documentation (tutorial + API): https://genal.readthedocs.io
- Introduction/tutorial page: https://genal.readthedocs.io/en/latest/introduction.html
- API reference: https://genal.readthedocs.io/en/latest/api.html
- Paper (citation): https://doi.org/10.1093/bioadv/vbae207
What can you do with genal?
genal is built around one main object, genal.Geno, which wraps a pandas DataFrame of SNP-level data and provides end-to-end workflows:
- Preprocess messy GWAS summary statistics: validate/standardize columns and alleles; fill missing
SNP,CHR,POS,NEA,SE,Pusing reference data. - Select independent instruments: LD clumping (PLINK2).
- Compute PRS: weighted/unweighted scoring from PLINK genotype files, including optional proxy SNP support when variants are missing.
- Run two-sample MR: multiple methods + sensitivity analyses (heterogeneity, pleiotropy), with plotting utilities.
- MR-PRESSO (parallel implementation): detect and correct horizontal pleiotropy.
- Colocalization: approximate Bayes-factor colocalization to assess whether two GWAS signals likely share a causal variant.
- Association testing: single-SNP regression in individual-level data (PLINK2) to recalibrate weights.
- Utilities: liftover between builds, GWAS Catalog querying, and filtering variants around a gene (useful for protein/drug-target analyses).
Flowchart (high-level pipeline)
Install
pip install genal-python
import genal
Setup: PLINK + reference data
Many genal features rely on PLINK v2.0 (clumping, PRS, association testing, proxy search, allele-frequency updates).
import genal
# Option 1: let genal download/install PLINK2 for your system (requires internet)
genal.install_plink()
# Option 2: point genal to an existing plink2 executable
genal.set_plink("/path/to/plink2") # or "/path/to/folder/containing/plink2"
genal uses reference data in two different ways:
Types of reference data
1) Build-only reference (for preprocessing)
Used by: Geno.preprocess_data().
- Purpose: fill/validate rsIDs, CHR/POS, and alleles.
- Argument:
reference_panel="37"orreference_panel="38". - You can also provide a custom reference (DataFrame with the expected columns, or a
.bim/.pvarpath).
2) LD reference panel (for LD operations)
Used by: Geno.clump(), proxy search in Geno.prs() / Geno.query_outcome(), and Geno.update_eaf().
- Purpose: derive LD (and/or allele frequency) from 1000G-like panels.
- Argument:
reference_panel="EUR_37"(or"AFR_38", etc.), or a path to a custom PLINK fileset. genaldownloads built-in panels automatically the first time you use them.
By default, reference files are stored under ~/.genal/, and temporary PLINK outputs go to ./tmp_GENAL/ (you can remove it with genal.delete_tmp()).
Quickstart (GWAS → instruments → PRS → MR)
import pandas as pd
import genal
# 1) Load exposure GWAS summary statistics into a DataFrame
exposure_df = pd.read_csv("exposure_gwas.tsv", sep="\t")
G_exposure = genal.Geno(
exposure_df,
CHR="CHR", POS="POS", SNP="SNP",
EA="EA", NEA="NEA",
BETA="BETA", SE="SE", P="P", EAF="EAF",
keep_columns=False,
)
# 2) Basic cleaning + columns filling (build-only reference for rsID/alleles/coords)
G_exposure.preprocess_data(preprocessing="Fill_delete", reference_panel="37")
# 3) LD clumping to select independent instruments
G_instruments = G_exposure.clump(p1=5e-8, r2=0.01, kb=10000, reference_panel="EUR_37")
# 4) PRS in a target cohort (PLINK1 bed/bim/fam or PLINK2 pgen/pvar/psam)
# If genotype files are split by chromosome, use '$' as a placeholder.
G_instruments.prs(name="my_prs", path="my_genotypes_chr$") # writes my_prs.csv
# 5) Two-sample MR
# 5.1) Load outcome GWAS summary statistics into a DataFrame
outcome_df = pd.read_csv("outcome_gwas.tsv", sep="\t")
G_outcome = genal.Geno(outcome_df, CHR="CHR", POS="POS", EA="EA", NEA="NEA", BETA="BETA", SE="SE", P="P", EAF="EAF", keep_columns=False)
# 5.2) Clean and standardize outcome GWAS summary statistics
G_outcome.preprocess_data(preprocessing="Fill_delete", reference_panel="37")
# 5.3) Query outcome GWAS (optionally search for proxies if some SNPs are missing)
G_instruments.query_outcome(G_outcome, proxy=True, reference_panel="EUR_37")
# 5.4) Run MR (optionally with heterogeneity test and odds-ratio CIs)
mr_results = G_instruments.MR(action=2, heterogeneity=True, odds=False)
# 5.5) Plot MR results
G_instruments.MR_plot(filename="mr_scatter", figure_size=(10, 6))
Core concept: the Geno object
genal.Geno is the central class.
G.data: main table containing SNP-level data (apandas.DataFrame)G.phenotype: stored afterG.set_phenotype(...)(phenotype DataFrame + metadata)G.MR_data: stored afterG.query_outcome(...)(exposure/outcome association tables used by MR)G.MR_results: stored afterG.MR(...)(results table + harmonized SNP table; used by plotting)G.MRpresso_subset_data: stored afterG.MRpresso(...)(outlier-removed harmonized table)
Most methods either:
- return a new
Genoobject (e.g.,clump()), or - update attributes and/or write output files (e.g.,
prs()writes a CSV).
Core API (what to call for common tasks)
- Load/hold SNP-level data:
genal.Geno(df, ...)→ main data is in.data - Preprocess/standardize summary stats:
Geno.preprocess_data(...)(modifies.data) - LD clumping (C+T instruments):
Geno.clump(...)→ returns a newGeno - PRS on genotype files:
Geno.prs(...)→ saves a CSV in the current directory - Align rsIDs to a target genotype dataset:
Geno.update_snpids(path=..., replace=...) - Extract genotype subset:
Geno.extract_snps(path=...)→ writes extracted files undertmp_GENAL/ - Two-sample MR pipeline:
Geno.query_outcome(...)→Geno.MR(...)(+MR_plot) - Leave-one-out MR:
Geno.MR_loo(...)→Geno.MR_loo_plot(...)(identify influential variants) - MR-PRESSO:
Geno.MRpresso(...)(parallel; outlier + distortion tests) - Colocalization:
Geno.colocalize(...)(approx Bayes factors; returns posterior probabilities) - Association testing (individual-level):
Geno.set_phenotype(...)→Geno.association_test(...) - Liftover builds:
Geno.lift(...)(downloads chain files if needed; optional UCSCliftOverexecutable) - GWAS Catalog query:
Geno.query_gwas_catalog(...)(GWAS Catalog REST API v2; row-level statuses are returned inASSOC_STATUS) - Gene-window filtering:
Geno.filter_by_gene(...)(filter variants around a gene; useful for protein/drug-target analyses)
Tutorial
The full tutorial with richer narrative is in the docs: https://genal.readthedocs.io/en/latest/introduction.html
Abbreviated tutorial
1) Create a Geno object from a DataFrame
import pandas as pd
import genal
df = pd.read_csv("my_gwas.tsv", sep="\t")
# Extract CHR and POS from a marker column if needed
df[["CHR", "POS", "_"]] = df["MarkerName"].str.split(":", expand=True)
G = genal.Geno(
df,
CHR="CHR", POS="POS",
EA="Allele1", NEA="Allele2",
BETA="Effect", SE="StdErr",
P="P", EAF="Freq1",
keep_columns=False,
)
Tip:
- Use
keep_columns=Trueto keep non-main columns.
2) Preprocess summary stats (recommended before any workflow)
G.preprocess_data(preprocessing="Fill_delete", reference_panel="37")
What preprocessing typically does (depending on options):
- validates types, formats, and values of CHR/POS/EA/NEA/BETA/SE/P/EAF columns
- detects OR vs beta columns (and log-transforms OR when needed)
- fills missing columns (e.g., rsID from CHR/POS, SE from BETA+P, P from BETA+SE)
- computes FSTAT (F-statistic) from BETA and SE when possible, with a fallback method when only P is present
- handles duplicates and invalid rows under
"Fill_delete"
You can inspect the standardized dataset at any time:
G.data.head()
Tips:
- Use
reference_panel="37"vs"38"to match your GWAS build. - If your GWAS reports odds ratios, use
effect_column="OR"inpreprocess_data. - If you do not want to delete duplicated rows or rows with missing values, use
preprocessing="Fill"instead of"Fill_delete".
3) Clump to create instruments
G_clumped = G.clump(
p1=5e-8,
r2=0.01,
kb=10000,
p2=0.01,
reference_panel="EUR_37",
)
G_clumped is a new Geno object containing the instrument set.
4) PRS (optionally with proxies)
Compute a PRS in a target cohort using PLINK genotype files.
- Supported inputs: bed/bim/fam or pgen/pvar/psam
- If your genotypes are split by chromosome, use a
$placeholder (e.g.,Pop_chr$).
# Basic PRS
G_clumped.prs(
name="prs",
path="target_chr$",
)
# writes prs.csv
If many variants are missing in the target genotype data, you can try proxying:
G_clumped.prs(
name="prs_proxy",
path="target_chr$",
proxy=True,
reference_panel="EUR_37",
r2=0.8,
kb=5000,
window_snps=5000,
)
# writes prs_proxy.csv
Notes:
- If
CHR/POSare present,genalprefers coordinate-based matching to find variants in your genotype data. prs()writes a CSV file and does not return a DataFrame.
5) Two-sample MR
Load an outcome GWAS into another Geno, preprocess it, then extract the exposure instruments from it.
outcome_df = pd.read_csv("outcome_gwas.tsv", sep="\t")
# Specify column names
G_outcome = genal.Geno(
outcome_df,
CHR="chromosome", POS="base_pair_location",
EA="effect_allele", NEA="other_allele",
BETA="beta", SE="standard_error",
P="p_value", EAF="effect_allele_frequency",
keep_columns=False,
)
G_outcome.preprocess_data(preprocessing="Fill_delete", reference_panel="37")
Prepare MR input (with optional proxying for missing instruments):
G_clumped.query_outcome(
G_outcome,
name="outcome",
proxy=True,
reference_panel="EUR_37",
)
Run MR:
res = G_clumped.MR(
action=2,
exposure_name="exposure",
outcome_name="outcome",
methods=["IVW", "IVW-FE", "WM", "Simple-mode", "Egger"],
heterogeneity=True,
odds=False, # set True if you want ORs for binary outcomes
)
res
About action (palindromic SNP handling during harmonization):
action=1: do not flip palindromes (assumes alleles already aligned)action=2: conservative flipping using allele frequencies (recommended default)action=3: drop all palindromes (most conservative)
Plot the MR scatter:
G_clumped.MR_plot(filename="mr_scatter", figure_size=(10, 6))
You can also draw a funnel plot of single-SNP ratio estimates (Wald ratios):
G_clumped.MR_funnel(
methods=["IVW", "WM", "Egger"], # vertical reference lines (optional)
filename="mr_funnel",
figure_size=(10, 6),
)
6) Sensitivity: MR-PRESSO
MR-PRESSO detects potential horizontal pleiotropy (outliers) and recomputes an IVW estimate after removing them.
mod_table, GlobalTest, OutlierTest, BiasTest = G_clumped.MRpresso(
action=2,
n_iterations=10000,
cpus=-1, # use all CPU cores
)
To highlight MR-PRESSO outliers on plots, pass use_mrpresso_data=True (outliers are colored in red):
G_clumped.MR_plot(filename="mr_scatter_mrpresso", figure_size=(10, 6), use_mrpresso_data=True)
G_clumped.MR_funnel(filename="mr_funnel_mrpresso", figure_size=(10, 6), use_mrpresso_data=True)
G_clumped.MR_loo_plot(filename="loo_forest_mrpresso", figure_size=(10, 8), use_mrpresso_data=True)
If you want to rerun MR methods after removing outliers with MR-PRESSO, you can use the use_mrpresso_data=True argument in MR():
res = G_clumped.MR(
action=2,
exposure_name="exposure",
outcome_name="outcome",
methods=["IVW", "IVW-FE", "WM", "Simple-mode", "Egger"],
heterogeneity=True,
odds=False, # set True if you want ORs for binary outcomes
use_mrpresso_data=True,
)
res
7) Sensitivity: Leave-One-Out MR
Leave-one-out MR helps identify influential variants that may be driving the causal estimate.
# Run leave-one-out analysis (default uses IVW)
loo_results = G_clumped.MR_loo(method="IVW", heterogeneity=False, odds=False)
Visualize the results with a forest plot:
# Default: show top influential instruments
G_clumped.MR_loo_plot(filename="loo_forest", figure_size=(10, 8))
Tips:
MR_looaccepts the sameaction,use_mrpresso_data, and method parameters asMR.MR_loo_plotsupportstop_influential=True(default) for a compact figure showing the most influential SNPs, ortop_influential=Falsefor paginated output with all instruments.MR_loo_plot(..., use_mrpresso_data=True)colors MR-PRESSO outliers in red (requires runningMRpresso()first). When outliers exist, an extra summary row ("MR-PRESSO corrected") is added using the same MR method as the leave-one-out analysis.MR_loo_plot(..., methods=["WM", "Egger"])adds extra overall estimates for the requested methods (computed on all instruments).- Set
odds=TrueinMR_looif you want odds ratio scaling on the plot.
8) Single-SNP association tests (individual-level data)
Use individual-level data to re-estimate SNP–trait effects in a specific cohort (e.g., different ancestry, different measurement protocol).
- Load phenotype table (must have matching
FID/IIDwith genotype files):
df_pheno = pd.read_csv("phenotypes.csv")
- Attach phenotype to your instrument set:
## It's recommended to make a copy of the `Geno` object to avoid confusion between the original and the adjusted dataset
G_adj = G_clumped.copy()
G_adj.set_phenotype(df_pheno, PHENO="sbp", IID="IID", FID="FID")
- Run per-SNP regression (optionally with covariates):
G_adj.association_test(
path="Pop_chr$",
covar=["age", "sex"],
)
This updates G_adj.data[["BETA","SE","P"]] with cohort-specific estimates and recomputes FSTAT to be consistent with the updated values.
9) Lift to a different build
Lift variants between builds (e.g., hg19 → hg38):
lifted = G.lift(start="hg19", end="hg38", replace=False)
lifted.head()
For large datasets, you can provide a UCSC LiftOver executable via liftover_path.
10) Query the GWAS Catalog
Attach a per-SNP list of associated traits using the GWAS Catalog API:
G_clumped.query_gwas_catalog(p_threshold=5e-8)
G_clumped.data[["SNP", "ASSOC", "ASSOC_STATUS", "ASSOC_ERROR"]].head()
Each ASSOC entry is a list of dictionaries with fixed keys:
[
{
"trait": "rheumatoid arthritis",
"p_value": 2e-29,
"study_accession": "GCST90132227",
}
]
Additional features
Additional features
Filter variants around a gene
Keep only variants within a window around a gene, using different gene identifier types:
# ±250kb around APOE (build 37 coordinates by default)
apoe_region = G_clumped.filter_by_gene(
gene="APOE",
id_type="symbol",
window_size=250000, # 250kb
build="37",
replace=False,
)
Notes:
- Set
replace=Trueto modify the object in place. - Several gene identifier types are supported:
symbol,HGNC,name,Ensembl,NCBI,UCSC,Vega.
Colocalization (ABF)
Run approximate Bayes factor colocalization between two Geno datasets (typically restricted to a locus/region).
# Example assumes both datasets contain (at least):
# CHR, POS, EA, NEA, BETA, SE, P (and ideally EAF)
coloc_res = G_trait_1.colocalize(
G_trait_2,
method="abf",
trait1_type="quant", # If trait 1 is a quantitative trait
trait2_type="cc", # If trait 2 is a case-control trait
n1=700_000, # sample size for trait 1
n2=40_000, # sample size for trait 2
)
coloc_res
Tip: colocalization is usually performed on a single region, not genome-wide.
Update allele frequencies (EAF)
If your dataset is missing EAF (or you want EAF from a specific ancestry reference), compute it using PLINK:
G.update_eaf(reference_panel="EUR_37", replace=True, fill=True)
Notes:
- Matching is performed by CHR/POS when available, otherwise by SNP ID.
- EAF is aligned to the dataset’s effect allele (
EA) when possible.
Standardize effect sizes
Standardize BETA and SE (useful when combining results across studies or preparing inputs for some downstream methods).
# Option 1: provide the outcome standard deviation
G_std = G.copy()
G_std.standardize_betas(outcome_sd=15.0, replace=True)
# Option 2: estimate from sample size (n_samples)
G_std.standardize_betas(n_samples=500_000, replace=True)
Extract variants / match variant IDs to genotype data
Two utilities that are useful when genotype variant IDs do not match rsIDs:
-
Match/update SNP IDs in your table based on CHR/POS in genotype files:
G.update_snpids(path="Pop_chr$", replace=True)
-
Extract the current SNP list from genotype data into
tmp_GENAL/:G.extract_snps(path="Pop_chr$")
Save and re-use datasets
Save a Geno to disk (handy for caching large outcome GWAS).
save() uses the object’s .name as the filename stem (by default it is a short random ID, but you can set it):
G.name = "Trait"
G.save(path=".") # writes Trait.parquet
G.save(path=".", fmt="parquet") # writes Trait.parquet
You can pass a saved .parquet path directly to query_outcome():
G.query_outcome("Trait.parquet", name="Trait", proxy=True, reference_panel="EUR_37")
FAQ / Troubleshooting
Q: genal says the PLINK path is not set.
A: Run genal.install_plink() (downloads PLINK2) or genal.set_plink("/path/to/plink2").
Q: preprocess_data can’t find rsIDs / too many SNPs are missing.
A: This usually means a build mismatch. If your GWAS is GRCh38/hg38, use reference_panel="38" (or lift your data with Geno.lift(start="hg38", end="hg19") to match build 37). Also confirm your CHR/POS columns are numeric and in base-pair coordinates.
Q: PRS/MR uses far fewer SNPs than expected.
A: Common causes are SNP-ID mismatches between summary stats and genotype/outcome datasets. If you have CHR/POS, genal prefers position-based matching; you can also try Geno.update_snpids(path="target_chr$") to align rsIDs to your target genotype files. For missing variants, enable proxy=True with an ancestry-matched LD reference panel.
Q: Clumping returns no SNPs.
A: Check that your P column is valid and that p1 isn’t too strict. Also ensure your SNP IDs exist in (or match) the LD reference panel used for clumping.
Q: MR results look off for binary outcomes.
A: Ensure your GWAS effects are on the log-odds scale. If your GWAS reports odds ratios, pass effect_column="OR" during preprocessing. Use odds=True in Geno.MR(...) to add an odds-ratio column with 95% CIs.
Q: Palindromic SNPs are being removed/flipped.
A: Control this with action in MR/MRpresso: 1 (no flipping), 2 (frequency-based flipping; default), 3 (drop all palindromes).
Q: GWAS Catalog queries fail or time out.
A: This is expected for many SNPs (catalog coverage is incomplete and rate limits exist). Check ASSOC_STATUS and ASSOC_ERROR for the row-level outcome. Only canonical dbSNP rsIDs (rs + digits) are queried. Consider increasing timeout or limiting max_associations.
Q: tmp_GENAL/ is large.
A: genal writes temporary PLINK outputs there and does not automatically delete them (for debugging purposes). Delete the whole folder with genal.delete_tmp().
Citation
If you use genal, please cite:
Genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization
Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N. Sheth, Guido J. Falcone, Julian N. Acosta.
Bioinformatics Advances 2024.
DOI: https://doi.org/10.1093/bioadv/vbae207
If you use methods derived from other packages (e.g., MR-PRESSO), please also cite the original method papers.
License
GPL-3.0-or-later (see LICENSE).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genal_python-1.6.tar.gz.
File metadata
- Download URL: genal_python-1.6.tar.gz
- Upload date:
- Size: 7.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e49fcdb570fd84a574f9d6b346350b617e80d86967147421d6b8a4a0a6abe27d
|
|
| MD5 |
50fbe6362099ad10042aea26762fb433
|
|
| BLAKE2b-256 |
938e95514796371cead41296abbdfb7195c6df7a531fa66a547eddb83a09b77f
|
File details
Details for the file genal_python-1.6-py3-none-any.whl.
File metadata
- Download URL: genal_python-1.6-py3-none-any.whl
- Upload date:
- Size: 110.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a66fd8f82360bb49762a646232e5d870b8162259810063dfeab169d16a30e301
|
|
| MD5 |
6475a481846011d53d2d704a87b50184
|
|
| BLAKE2b-256 |
c2353bfe45baa778f6cd7febb11d792494c08f1d97dbb7640bc9fe81feb40b3d
|