A simulation tool for generating glycomic relative abundance datasets with customizable biological group differences and controllable batch-effect injection
Project description
GlycoForge is a simulation tool for generating glycomic relative-abundance datasets with customizable biological group differences and controllable batch-effect injection.
Key Features
- Two simulation modes: Fully synthetic or templated (extract factor from input reference data + simulate batch effect)
- Controllable effects injection: Systematic grid search over biological effect or batch effect strength parameters
- Motif-level effects: For both bio and batch effects, desired motif differences (e.g.,
Neu5Ac: down) can be introduced. These are propagated in a dynamically constructed biosynthetic network to ensure physiological glycomics data (e.g., corresponding increase in desialylated glycans in the example ofNeu5Ac: down) - MNAR missing data simulation: Mimics left-censored patterns biased toward low-abundance glycans
Quick Start
Installation
- Python >= 3.10 required.
- Core dependency:
glycowork>=1.6.4
pip install glycoforge
OR
git clone https://github.com/BojarLab/GlycoForge.git
cd GlycoForge
python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .
Usage
See run_simulation.ipynb for interactive examples, or use_cases/batch_correction/
for batch correction workflows.
How the simulator works
We keep everything in the CLR (centered log-ratio) space:
- First, draw a healthy baseline composition from a Dirichlet prior:
p_H ~ Dirichlet(alpha_H). - Flip to CLR:
z_H = clr(p_H). - For selected glycans, push the signal using real or synthetic effect sizes:
z_U = z_H + m * lambda * d_robust, wheremis the differential mask,lambdaisbio_strength, andd_robustis the effect vector afterrobust_effect_size_processing.- Simplified mode: draw synthetic effect sizes (log-fold changes) and pass them through the same robust processing pipeline.
- Hybrid mode: start from the Cohen’s d values returned by
glycowork.get_differential_expression;define_differential_masklets you restrict the injection to significant hits or top-N glycans before scaling.
- Invert back to proportions:
p_U = invclr(z_U)and scale byk_dirto getalpha_U, note that the healthy and unhealthy Dirichlet strengths use differentk_dirvalues, and a separatevariance_ratiocontrols their relative magnitude. - Batch effects ride on top as direction vectors
u_b, so a clean CLR sampleY_cleanbecomesY_with_batch = Y_clean + kappa_mu * u_b + epsilon, withvar_bcontrolling spread.
Simulation Modes
The pipeline entry point is glycoforge.simulate() with two modes controlled by data_source. Configuration files are in sample_config/.
Synthetic mode (data_source="simulated") – Fully synthetic simulation (click to show detail introduction)
No real data dependency. Ideal for controlled experiments with known ground truth.
Pipeline steps:
- Initializes log-normal healthy baseline:
alpha_H = ones(n_glycans) * 10 - For each random seed, generates
alpha_Uby randomly scalingalpha_H:up_frac(default 30%) upregulated with scale factors fromup_scale_range=(1.1, 3.0)down_frac(default 30%) downregulated with scale factors fromdown_scale_range=(0.3, 0.9)- Remaining glycans (~40%) stay unchanged
- Samples clean cohorts from
Dirichlet(alpha_H)andDirichlet(alpha_U)withn_Hhealthy andn_Uunhealthy samples - Defines batch effect direction vectors
u_dictonce per simulation run (fixed seed ensures reproducible batch geometry across parameter sweep) - Applies batch effects controlled by
kappa_mu(shift strength) andvar_b(variance scaling) - Optionally applies MNAR (Missing Not At Random) missingness:
missing_fraction: proportion of missing values (0.0-1.0)mnar_bias: intensity-dependent bias (default 2.0, range 0.5-5.0)- Left-censored pattern: low-abundance glycans more likely to be missing
- Grid search over
kappa_muandvar_bproduces multiple datasets under identical batch effect structure
Key parameters: n_glycans, n_H, n_U, kappa_mu, var_b, missing_fraction, mnar_bias
Templated mode (data_source="real") – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction)
Starts from real glycomics data to preserve biological signal structure. Accepts CSV file or glycowork.glycan_data datasets.
Pipeline steps:
- Loads CSV and extracts healthy/unhealthy sample columns by prefix (configurable via
column_prefix) - Runs CLR-based differential expression via
glycowork.get_differential_expressionto compute Cohen's d effect sizes - Reindexes effect sizes to match input glycan order (fills missing glycans with 0.0)
- Applies
differential_maskto select which glycans receive biological signal injection:"All": inject into all glycans"significant": only glycans marked significant by glycowork"Top-N": top N glycans by absolute effect size (e.g.,"Top-10")
- Processes effect sizes through
robust_effect_size_processing:- Centers effect sizes to remove global shift
- Applies Winsorization to clip extreme outliers (auto-selects percentile 85-99, or uses
winsorize_percentile) - Normalizes by baseline (
baseline_method: median, MAD, or p75) - Returns normalized
d_robustscaled bybio_strength
- Injects effects in CLR space:
z_U = z_H + mask * bio_strength * d_robust - Converts back to proportions:
p_U = invclr(z_U) - Scales by Dirichlet concentration:
alpha_H = k_dir * p_Handalpha_U = (k_dir / variance_ratio) * p_U - Samples clean cohorts from
Dirichlet(alpha_H)andDirichlet(alpha_U)withn_Hhealthy andn_Uunhealthy samples - Defines batch effect direction vectors
u_dictonce per run (fixed seed ensures fair comparison across parameter combinations) - Applies batch effects:
y_batch = y_clean + kappa_mu * sigma * u_b + epsilon, whereepsilon ~ N(0, sqrt(var_b) * sigma) - Optionally applies MNAR missingness (same as Simplified mode: left-censored pattern biased toward low-abundance glycans)
- Grid search over
bio_strength,k_dir,variance_ratio,kappa_mu,var_bto systematically test biological signal and batch effect interactions
Key parameters: data_file, column_prefix, bio_strength, k_dir, variance_ratio, differential_mask, winsorize_percentile, baseline_method, kappa_mu, var_b, missing_fraction, mnar_bias
Use Cases
The use_cases/batch_correction/ directory demonstrates:
- Call
glycoforgesimulation, and then apply correction workflow - Batch correction effectiveness metrics visualization
Limitation
Two biological groups only: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (>=3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file glycoforge-0.1.3.tar.gz.
File metadata
- Download URL: glycoforge-0.1.3.tar.gz
- Upload date:
- Size: 34.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f2b5af4799488d7bc436caa81edfeec0dd4dfd019e3ed861688d4a3fe796ee2
|
|
| MD5 |
e3f2f87a347dec7abff47ab6979ed668
|
|
| BLAKE2b-256 |
f94e4cfa48fa508baa0000515e9099ed5945b5a147acb45c03c0d229fe64dde0
|
File details
Details for the file glycoforge-0.1.3-py3-none-any.whl.
File metadata
- Download URL: glycoforge-0.1.3-py3-none-any.whl
- Upload date:
- Size: 34.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16fe2b3485166d2a53d7f57fc5599ca612aad65345cffcae0da759622f9c4a19
|
|
| MD5 |
053a57196cb54da7248d3f783e2d8451
|
|
| BLAKE2b-256 |
ad5a59992a3c49538288385bc4d829fb53c10cc1563295ae59047114de1e7ed7
|