glycoforge

A simulation tool for generating glycomic relative abundance datasets with customizable biological group differences and controllable batch-effect injection

These details have not been verified by PyPI

Project links

Project description

GlycoForge is a simulation tool for generating glycomic relative-abundance datasets with customizable biological group differences and controllable batch-effect injection.

Key Features

Two simulation modes: Fully synthetic or hybrid (extract factor from input reference data + simulate batch effect)
Controllable effects injection: Systematic grid search over biological effect or batch effect strength parameters
MNAR missing data simulation: Mimics left-censored patterns biased toward low-abundance glycans

Quick Start

Installation

Python >= 3.10 required.
Core dependency: glycowork>=1.6.4

git clone https://github.com/BojarLab/GlycoForge.git
cd GlycoForge
python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .

Usage

See run_simulation.ipynb for interactive examples, or use_cases/batch_correction/ for batch correction workflows.

How the simulator works

We keep everything in the CLR (centered log-ratio) space:

First, draw a healthy baseline composition from a Dirichlet prior: p_H ~ Dirichlet(alpha_H).
Flip to CLR: z_H = clr(p_H).
For selected glycans, push the signal using real or synthetic effect sizes: z_U = z_H + m * lambda * d_robust, where m is the differential mask, lambda is bio_strength, and d_robust is the effect vector after robust_effect_size_processing.
- Simplified mode: draw synthetic effect sizes (log-fold changes) and pass them through the same robust processing pipeline.
- Hybrid mode: start from the Cohen’s d values returned by glycowork.get_differential_expression; define_differential_mask lets you restrict the injection to significant hits or top-N glycans before scaling.
Invert back to proportions: p_U = invclr(z_U) and scale by k_dir to get alpha_U, note that the healthy and unhealthy Dirichlet strengths use different k_dir values, and a separate variance_ratio controls their relative magnitude.
Batch effects ride on top as direction vectors u_b, so a clean CLR sample Y_clean becomes Y_with_batch = Y_clean + kappa_mu * u_b + epsilon, with var_b controlling spread.

Simulation Modes

The pipeline entry point is glycoforge.simulate() with two modes controlled by data_source. Configuration files are in sample_config/.

Simplified mode (data_source="simulated") – Fully synthetic simulation (click to show detail introduction)

No real data dependency. Ideal for controlled experiments with known ground truth.

Pipeline steps:

Initializes uniform healthy baseline: alpha_H = ones(n_glycans) * 10
For each random seed, generates alpha_U by randomly scaling alpha_H:
- up_frac (default 30%) upregulated with scale factors from up_scale_range=(1.1, 3.0)
- down_frac (default 30%) downregulated with scale factors from down_scale_range=(0.3, 0.9)
- Remaining glycans (~40%) stay unchanged
Samples clean cohorts from Dirichlet(alpha_H) and Dirichlet(alpha_U) with n_H healthy and n_U unhealthy samples
Defines batch effect direction vectors u_dict once per simulation run (fixed seed ensures reproducible batch geometry across parameter sweep)
Applies batch effects controlled by kappa_mu (shift strength) and var_b (variance scaling)
Optionally applies MNAR (Missing Not At Random) missingness:
- missing_fraction: proportion of missing values (0.0-1.0)
- mnar_bias: intensity-dependent bias (default 2.0, range 0.5-5.0)
- Left-censored pattern: low-abundance glycans more likely to be missing
Grid search over kappa_mu and var_b produces multiple datasets under identical batch effect structure

Key parameters: n_glycans, n_H, n_U, kappa_mu, var_b, missing_fraction, mnar_bias

Hybrid mode (data_source="real") – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction)

Starts from real glycomics data to preserve biological signal structure. Accepts CSV file or glycowork.glycan_data datasets.

Pipeline steps:

Loads CSV and extracts healthy/unhealthy sample columns by prefix (configurable via column_prefix)
Runs CLR-based differential expression via glycowork.get_differential_expression to compute Cohen's d effect sizes
Reindexes effect sizes to match input glycan order (fills missing glycans with 0.0)
Applies differential_mask to select which glycans receive biological signal injection:
- "All": inject into all glycans
- "significant": only glycans marked significant by glycowork
- "Top-N": top N glycans by absolute effect size (e.g., "Top-10")
Processes effect sizes through robust_effect_size_processing:
- Centers effect sizes to remove global shift
- Applies Winsorization to clip extreme outliers (auto-selects percentile 85-99, or uses winsorize_percentile)
- Normalizes by baseline (baseline_method: median, MAD, or p75)
- Returns normalized d_robust scaled by bio_strength
Injects effects in CLR space: z_U = z_H + mask * bio_strength * d_robust
Converts back to proportions: p_U = invclr(z_U)
Scales by Dirichlet concentration: alpha_H = k_dir * p_H and alpha_U = (k_dir / variance_ratio) * p_U
Samples clean cohorts from Dirichlet(alpha_H) and Dirichlet(alpha_U) with n_H healthy and n_U unhealthy samples
Defines batch effect direction vectors u_dict once per run (fixed seed ensures fair comparison across parameter combinations)
Applies batch effects: y_batch = y_clean + kappa_mu * sigma * u_b + epsilon, where epsilon ~ N(0, sqrt(var_b) * sigma)
Optionally applies MNAR missingness (same as Simplified mode: left-censored pattern biased toward low-abundance glycans)
Grid search over bio_strength, k_dir, variance_ratio, kappa_mu, var_b to systematically test biological signal and batch effect interactions

Key parameters: data_file, column_prefix, bio_strength, k_dir, variance_ratio, differential_mask, winsorize_percentile, baseline_method, kappa_mu, var_b, missing_fraction, mnar_bias

Use Cases

The use_cases/batch_correction/ directory demonstrates:

Call glycoforge simulation + ComBat correction workflow
Batch correction effectiveness metrics visualization

Limitations and Future Work

Two biological groups only: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (>=3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics.
Packaging: Source-first distribution for now. PyPI release planned once API stabilizes.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Feb 18, 2026

0.1.2

Dec 20, 2025

This version

0.1.0

Dec 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glycoforge-0.1.0.tar.gz (30.1 kB view details)

Uploaded Dec 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

glycoforge-0.1.0-py3-none-any.whl (26.9 kB view details)

Uploaded Dec 19, 2025 Python 3

File details

Details for the file glycoforge-0.1.0.tar.gz.

File metadata

Download URL: glycoforge-0.1.0.tar.gz
Upload date: Dec 19, 2025
Size: 30.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.4

File hashes

Hashes for glycoforge-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b643c2a65c8cd5431b1dcf3e0180106a12c944051bed4eb872c52ac0f9b5a681`
MD5	`7666d02b50dbed4f3e79e8a4f5741a24`
BLAKE2b-256	`7a7aca5f7fe2ae0b36e185bb8a3554278dbbd758e5cfae2023166f7b3cba49c0`

See more details on using hashes here.

File details

Details for the file glycoforge-0.1.0-py3-none-any.whl.

File metadata

Download URL: glycoforge-0.1.0-py3-none-any.whl
Upload date: Dec 19, 2025
Size: 26.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.4

File hashes

Hashes for glycoforge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8be4f1ce5b98ddad13c1cbc0b8b421f4c134b13e1b32c94fedca8709f30403b5`
MD5	`608c2b9c018980962e1ac67bf90e1fd2`
BLAKE2b-256	`d00a26aa8140c65c712def70899c092c4dd44690c8991060e89cc9d6ca6bfb0e`

See more details on using hashes here.

glycoforge 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Key Features

Quick Start

Installation

Usage

How the simulator works

Simulation Modes

Use Cases

Limitations and Future Work

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes