Skip to main content

A simulation tool for generating glycomic relative abundance datasets with customizable biological group differences and controllable batch-effect injection

Project description

GlycoForge logo

GlycoForge is a simulation tool for generating glycomic relative-abundance datasets with customizable biological group differences and controllable batch-effect injection.

Key Features

  • Two simulation modes: Fully synthetic or hybrid (extract factor from input reference data + simulate batch effect)
  • Controllable effects injection: Systematic grid search over biological effect or batch effect strength parameters
  • MNAR missing data simulation: Mimics left-censored patterns biased toward low-abundance glycans

Quick Start

Installation

  • Python >= 3.10 required.
  • Core dependency: glycowork>=1.6.4
git clone https://github.com/BojarLab/GlycoForge.git
cd GlycoForge
python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .

Usage

See run_simulation.ipynb Open In Colabfor interactive examples, or use_cases/batch_correction/ Open In Colab for batch correction workflows.

How the simulator works

We keep everything in the CLR (centered log-ratio) space:

  • First, draw a healthy baseline composition from a Dirichlet prior: p_H ~ Dirichlet(alpha_H).
  • Flip to CLR: z_H = clr(p_H).
  • For selected glycans, push the signal using real or synthetic effect sizes: z_U = z_H + m * lambda * d_robust, where m is the differential mask, lambda is bio_strength, and d_robust is the effect vector after robust_effect_size_processing.
    • Simplified mode: draw synthetic effect sizes (log-fold changes) and pass them through the same robust processing pipeline.
    • Hybrid mode: start from the Cohen’s d values returned by glycowork.get_differential_expression; define_differential_mask lets you restrict the injection to significant hits or top-N glycans before scaling.
  • Invert back to proportions: p_U = invclr(z_U) and scale by k_dir to get alpha_U, note that the healthy and unhealthy Dirichlet strengths use different k_dir values, and a separate variance_ratio controls their relative magnitude.
  • Batch effects ride on top as direction vectors u_b, so a clean CLR sample Y_clean becomes Y_with_batch = Y_clean + kappa_mu * u_b + epsilon, with var_b controlling spread.

Simulation Modes

The pipeline entry point is glycoforge.simulate() with two modes controlled by data_source. Configuration files are in sample_config/.

Simplified mode (data_source="simulated") – Fully synthetic simulation (click to show detail introduction)

No real data dependency. Ideal for controlled experiments with known ground truth.

Pipeline steps:

  1. Initializes uniform healthy baseline: alpha_H = ones(n_glycans) * 10
  2. For each random seed, generates alpha_U by randomly scaling alpha_H:
    • up_frac (default 30%) upregulated with scale factors from up_scale_range=(1.1, 3.0)
    • down_frac (default 30%) downregulated with scale factors from down_scale_range=(0.3, 0.9)
    • Remaining glycans (~40%) stay unchanged
  3. Samples clean cohorts from Dirichlet(alpha_H) and Dirichlet(alpha_U) with n_H healthy and n_U unhealthy samples
  4. Defines batch effect direction vectors u_dict once per simulation run (fixed seed ensures reproducible batch geometry across parameter sweep)
  5. Applies batch effects controlled by kappa_mu (shift strength) and var_b (variance scaling)
  6. Optionally applies MNAR (Missing Not At Random) missingness:
    • missing_fraction: proportion of missing values (0.0-1.0)
    • mnar_bias: intensity-dependent bias (default 2.0, range 0.5-5.0)
    • Left-censored pattern: low-abundance glycans more likely to be missing
  7. Grid search over kappa_mu and var_b produces multiple datasets under identical batch effect structure

Key parameters: n_glycans, n_H, n_U, kappa_mu, var_b, missing_fraction, mnar_bias

Hybrid mode (data_source="real") – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction)

Starts from real glycomics data to preserve biological signal structure. Accepts CSV file or glycowork.glycan_data datasets.

Pipeline steps:

  1. Loads CSV and extracts healthy/unhealthy sample columns by prefix (configurable via column_prefix)
  2. Runs CLR-based differential expression via glycowork.get_differential_expression to compute Cohen's d effect sizes
  3. Reindexes effect sizes to match input glycan order (fills missing glycans with 0.0)
  4. Applies differential_mask to select which glycans receive biological signal injection:
    • "All": inject into all glycans
    • "significant": only glycans marked significant by glycowork
    • "Top-N": top N glycans by absolute effect size (e.g., "Top-10")
  5. Processes effect sizes through robust_effect_size_processing:
    • Centers effect sizes to remove global shift
    • Applies Winsorization to clip extreme outliers (auto-selects percentile 85-99, or uses winsorize_percentile)
    • Normalizes by baseline (baseline_method: median, MAD, or p75)
    • Returns normalized d_robust scaled by bio_strength
  6. Injects effects in CLR space: z_U = z_H + mask * bio_strength * d_robust
  7. Converts back to proportions: p_U = invclr(z_U)
  8. Scales by Dirichlet concentration: alpha_H = k_dir * p_H and alpha_U = (k_dir / variance_ratio) * p_U
  9. Samples clean cohorts from Dirichlet(alpha_H) and Dirichlet(alpha_U) with n_H healthy and n_U unhealthy samples
  10. Defines batch effect direction vectors u_dict once per run (fixed seed ensures fair comparison across parameter combinations)
  11. Applies batch effects: y_batch = y_clean + kappa_mu * sigma * u_b + epsilon, where epsilon ~ N(0, sqrt(var_b) * sigma)
  12. Optionally applies MNAR missingness (same as Simplified mode: left-censored pattern biased toward low-abundance glycans)
  13. Grid search over bio_strength, k_dir, variance_ratio, kappa_mu, var_b to systematically test biological signal and batch effect interactions

Key parameters: data_file, column_prefix, bio_strength, k_dir, variance_ratio, differential_mask, winsorize_percentile, baseline_method, kappa_mu, var_b, missing_fraction, mnar_bias

Use Cases

The use_cases/batch_correction/ directory demonstrates:

  • Call glycoforge simulation + ComBat correction workflow
  • Batch correction effectiveness metrics visualization

Limitations and Future Work

  1. Two biological groups only: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (>=3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics.
  2. Packaging: Source-first distribution for now. PyPI release planned once API stabilizes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glycoforge-0.1.0.tar.gz (30.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glycoforge-0.1.0-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file glycoforge-0.1.0.tar.gz.

File metadata

  • Download URL: glycoforge-0.1.0.tar.gz
  • Upload date:
  • Size: 30.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.4

File hashes

Hashes for glycoforge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b643c2a65c8cd5431b1dcf3e0180106a12c944051bed4eb872c52ac0f9b5a681
MD5 7666d02b50dbed4f3e79e8a4f5741a24
BLAKE2b-256 7a7aca5f7fe2ae0b36e185bb8a3554278dbbd758e5cfae2023166f7b3cba49c0

See more details on using hashes here.

File details

Details for the file glycoforge-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: glycoforge-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.4

File hashes

Hashes for glycoforge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8be4f1ce5b98ddad13c1cbc0b8b421f4c134b13e1b32c94fedca8709f30403b5
MD5 608c2b9c018980962e1ac67bf90e1fd2
BLAKE2b-256 d00a26aa8140c65c712def70899c092c4dd44690c8991060e89cc9d6ca6bfb0e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page