Skip to main content

Identify fertile ground in genomes for minimal-edit regulatory design.

Project description

fertilizer

The Fertile Ground Hypothesis is that genomes are full of "almost-regulatory" regions — sequences that do not do anything on their own, but can be minimally edited to achieve subtle and precise activity. Many near-motifs, for example, sit one or two substitutions away from binding a transcription factor and recruiting its downstream regulatory activity. fertilizer helps identify the fertile ground in a genome that is most useful for your design task.

[!WARNING] Everything about this package, including the unit tests and this README (except for this line), was generated by Claude. We are in the process of validating everything, but use with caution for now.

Installation

fertilizer is installable with uv. The PyPI distribution is named fertilizer-genomics (the fertilizer name was taken), but the importable module is still fertilizer.

# install from PyPI
uv pip install fertilizer-genomics

# or install directly from GitHub
uv pip install git+https://github.com/jmschrei/fertilizer.git

# or install from a local clone
git clone https://github.com/jmschrei/fertilizer.git
cd fertilizer
uv pip install -e .

For a development environment with test and lint tooling:

uv venv
uv pip install -e ".[dev]"
pytest tests/

Usage

fertilizer exposes two subcommands. Typical workflow:

fertilizer extract -w A.bw B.bw C.bw -b regions.bed -o signals.tsv
fertilizer diff    -i signals.tsv -c A B C -o diff.tsv

fertilizer extract — signal aggregation

Compute the mean signal in each bigWig over each region in the concatenated BED input. One row per region, one column per bigWig. Input row order is preserved.

flag description
-w, --bigwigs one or more bigWig signal tracks
-b, --beds one or more BED region files (first three columns are used; # comments are skipped)
-o, --output path to the output TSV
-s, --stat per-region summary statistic: mean (default), max, min, sum, std, coverage. Maps directly to pyBigWig's stats(type=...)
-j, --n-jobs parallel workers (default -1, all cores)

Zeros never mean "missing" — the output is always numeric, never NaN. A region whose mean signal genuinely resolves to zero (empty bigWig, uncovered span) is reported as 0.0 silently. A region with a locus-level problem (unknown chromosome, coordinates past the end of the chromosome, zero-length interval, negative start) is also reported as 0.0 but triggers a single FertilizerWarning — one warning per distinct issue type per run, regardless of how many rows or bigWigs were affected.

Example signals.tsv:

chrom   start   end     A       B       C
chr1    0       500     2.0     7.0     3.1
chr1    500     1000    6.0     7.0     5.8
chr1    400     600     4.0     7.0     4.4
chr2    0       100     0.0     0.0     0.0

Column names come from each bigWig's filename stem, so passing two bigWigs with the same basename (even from different directories) is rejected up front.

fertilizer diff — differential activity

Identify loci that vary across two or more conditions. Takes a TSV whose columns include (at minimum) the non-negative numeric columns named via -c — the output of fertilizer extract, which also carries chrom/start/end, is the canonical input and is passed through verbatim — names the columns to compare, and writes a filtered TSV containing only loci that pass the significance threshold, with effect size, p-value, q-value, and the name of the condition carrying the highest signal. Any columns present in the input beyond the tested conditions are preserved in the output.

The test is a negative-binomial GLM likelihood-ratio test per locus, inspired by DESeq2 (Love, Huber, Anders, 2014) and adapted to the one-replicate-per-condition setting this package targets. Steps:

  1. Size factors. DESeq2's median-of-ratios, computed jointly across all conditions from loci with positive signal in every condition.
  2. Dispersion. With no within-condition replicates, per-locus variance cannot be estimated from repeated observations. A dispersion parameter α is estimated across loci under the null-majority assumption. Default (--fit-type common) uses the median of per-locus method-of-moments estimates, scaled by a closed-form correction for the small-df median-of-χ² bias; an alternative parametric trend α(μ) = a/μ + b (--fit-type parametric) is available, fit by robust (MAD-trimmed) weighted least squares and subject to the same scale correction.
  3. NB-GLM LRT + BH. Per locus, the null (all conditions share one mean μ₀) is fit by vectorized Newton iteration on the intercept-only NB score equation; the full model is saturated (μⱼ = Xⱼ). The statistic T = 2·(ℓ_full − ℓ_null) is distributed as χ²(K−1) asymptotically under the null. Benjamini–Hochberg q-values control the FDR across loci.

FDR assumes independent (or positively dependent) loci. BH controls FDR under independence or positive regression dependency (PRDS). BigWig signal on adjacent windows is correlated — overlapping tiling windows, shared peaks, or bin sizes smaller than the underlying signal's autocorrelation length all introduce dependence. For typical "one row per peak / per gene" inputs this is fine. For dense sliding-window inputs (e.g. 100 bp windows stepped every 50 bp), q-values will be optimistic; prefer non-overlapping windows or thin by autocorrelation length before trusting the FDR.

Columns appended to the output:

column meaning
effect_size max_k |log2(X_k/s_k + pc) − log2(grand_mean + pc)|, i.e. the largest fold change of any single condition vs the per-locus grand mean on the size-factor-normalized scale (computed as a log-difference; equivalent to a log2 ratio when pc is small)
p_value χ²(K−1) p-value from the NB-GLM LRT
q_value Benjamini–Hochberg q-value
max_condition name of the condition column with the highest normalized signal

Size factors, the estimated dispersion fit, its trend coefficients, and the number of kept/total loci are printed to stderr. The dispersion-fit label is one of:

label meaning
common --fit-type common succeeded (single α = median of per-locus MoM estimates)
parametric --fit-type parametric succeeded (fitted α(μ) = a/μ + b)
common-fallback --fit-type parametric was requested but failed; fell back to the common fit
override --dispersion was supplied; the fixed α was used at every locus
zero --fit-type zero was supplied; Poisson was forced at every locus

Size-factor spread, Poisson fallbacks, --fit-type zero, common-fallback, and K=2 calibration all emit a FertilizerDiffWarning on stderr.

CLI flag effect
-i, --input input TSV
-c, --conditions two or more column names to compare
-o, --output output TSV, filtered to loci passing the threshold, with extra columns appended
--q-threshold keep loci with q ≤ this (default 0.05; set to 1.0 to keep all rows)
--p-threshold additionally keep only loci with raw p ≤ this (default: off)
--fit-type dispersion model: common (default, median of MoM estimates, bias-corrected), parametric (fits α(μ) = a/μ + b), or zero (forces Poisson — diagnostic only, strictly anti-conservative if real overdispersion exists)
--min-signal minimum mean normalized signal for loci included in dispersion estimation (default 5.0)
--dispersion override the fitted α with a fixed value applied to every locus; bypasses --fit-type and --min-signal entirely and sets dispersion_fit = override. Useful for sensitivity analyses (e.g. re-run at 0.05 and 0.10 to see how much calls depend on α)
--pseudocount pseudocount for the effect-size log2 transform only; does not affect the LRT. Must be > 0 (default 0.5)

Differences from DESeq2 (non-exhaustive):

  • Per-locus dispersion MLE. DESeq2 uses Cox-Reid adjusted profile likelihood; we use method-of-moments. MoM is less efficient per locus but is consistent and does not fail to converge.
  • Shrinkage. DESeq2 shrinks per-locus dispersion toward the trend via a log-normal empirical-Bayes prior and retains dispersion outliers. We use the trend value directly for every locus (equivalent to infinite shrinkage, no outlier retention) — robust for the small-K setting, but cannot capture genuinely heterogeneous per-locus dispersion.
  • Bias correction. Because median(χ²(K−1))/(K−1) < 1 for small K (0.69 at K=3, 0.84 at K=5, 0.91 at K=8), median-of-MoM underestimates α. We apply the closed-form correction.
  • Log2 fold change shrinkage. DESeq2 optionally shrinks LFC estimates (apeglm / ashr); we report raw max |log2FC vs grand mean| as the effect size.
  • Observation-level outliers. DESeq2 uses Cook's distance to flag and optionally refit without outliers. We don't.
  • Independent filtering. DESeq2 filters low-signal loci out of multiple-testing correction to maximize power at a given FDR. We don't — use --min-signal (dispersion-only) or pre-filter the input TSV if you want this.
  • Arbitrary designs. DESeq2's LRT supports any full vs reduced formula. We hard-code the all-conditions-equal null vs per-condition-mean full.
  • Integer counts. DESeq2 is designed for integer RNA-seq counts; the NB likelihood here is evaluated with scipy.special.gammaln and is numerically correct for any non-negative float input. (This matches the common practice of passing fractional RSEM/salmon expected counts to DESeq2 via tximport, and is required here because bigWig region means are real-valued.)

Known calibration behavior. Our test suite verifies Type-I error at α=0.05 stays under 0.08 (~1.6× nominal) on Poisson nulls for K ∈ {3, 5, 8} and μ ∈ {30, 100, 500}, and under 0.09 (~1.8× nominal) on NB nulls for K ∈ {3, 5, 8}, μ = 100, α_true ∈ {0.05, 0.10}. At low μ (< ~5) the delta-method and the median-bias correction both degrade, and low-μ loci are excluded from dispersion estimation via --min-signal. If you care about exact FDR for a specific dataset, cross-validate against DESeq2.

Supply a mix of positive and negative loci. The size-factor and dispersion steps both assume that most loci are not differential — the "null majority" is what anchors the normalization and the variance estimate. If you run diff on a set of regions pre-filtered to be those you expect to change, you will get sub-optimal results: the estimated library-size differences will absorb real biological differences, and the dispersion estimate will be inflated by the true positives. For best results pass a genome-wide or background-matched set of loci containing both putative-differential and expected-stable regions.

Example (three conditions):

fertilizer diff -i signals.tsv -c A B C -o diff.tsv --q-threshold 0.05

Output diff.tsv (filtered, numbers illustrative):

chrom   start   end     A     B     C       effect_size     p_value     q_value     max_condition
chr1    0       500     2.0   2.1   8.4     2.07            0.0004      0.0032      C
chr3    1200    1700    15.2  3.6   4.1     2.08            0.0011      0.0060      A
chr7    900     1400    8.8   8.9   30.1    1.78            0.0018      0.0090      C

Python API

Both subcommands are thin wrappers around library functions. The same work can be done from Python:

import numpy as np
import pandas as pd

from fertilizer.extract import bigwig_region_means, load_regions, FertilizerWarning
from fertilizer.diff import (
    differential_analysis,
    size_factors,
    bh_qvalues,
    DiffResult,
    FertilizerDiffWarning,
)

# --- extract: region means for one bigWig -------------------------
regions = load_regions(["regions.bed"])                  # chrom/start/end
values, issues = bigwig_region_means(regions, "A.bw")    # np.ndarray, set[str]
# `issues` is a subset of {"missing_chrom", "out_of_bounds", "invalid_region"}

# --- diff: NB-GLM LRT on a (n_loci, n_conditions) array -----------
counts = pd.read_csv("signals.tsv", sep="\t")[["A", "B", "C"]].to_numpy(float)
res: DiffResult = differential_analysis(counts, fit_type="common")
# res.p_value, res.q_value, res.effect_size, res.size_factors,
# res.per_locus_dispersion, res.dispersion_fit, res.dispersion_trend,
# res.max_condition_idx  (all np.ndarray except the two scalars)

FertilizerWarning (locus-level issues from extract) and FertilizerDiffWarning (size-factor spread, Poisson fallbacks from diff) are both UserWarning subclasses — catch them with warnings.catch_warnings() or filter them with warnings.simplefilter(..., FertilizerWarning).

Project layout

fertilizer/
├── pyproject.toml             # package metadata + uv/hatchling build config
├── README.md
├── LICENSE
├── src/
│   └── fertilizer/
│       ├── __init__.py
│       ├── cli.py             # top-level argparse dispatcher
│       ├── extract.py         # signal aggregation + `extract` subcommand
│       └── diff.py            # differential-activity analysis + `diff` subcommand
└── tests/
    ├── test_cli.py
    └── test_diff.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fertilizer_genomics-0.1.0.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fertilizer_genomics-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded Python 3

File details

Details for the file fertilizer_genomics-0.1.0.tar.gz.

File metadata

  • Download URL: fertilizer_genomics-0.1.0.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for fertilizer_genomics-0.1.0.tar.gz
Algorithm Hash digest
SHA256 47f9916ddb1b5ee90e78d2aca710719c2c49bd3e394b19d8bc40c3bf2bcb3340
MD5 5b9c9e27398affceafa1eb0761801796
BLAKE2b-256 85565686201fbf666e7662631f280ce764e056bfb83aa3dae28181c2b1e9754c

See more details on using hashes here.

File details

Details for the file fertilizer_genomics-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for fertilizer_genomics-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 430ada73b9fca16c733446a9be244d61dea80bea398c93e413b8f3181868eb6c
MD5 c432149fb1141f38ffaf58017cb1a73d
BLAKE2b-256 bfb75ec5825cac4336cddf2bef8b2c0b112bb0c6a0c013261df674cf95fa1f54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page