Identify fertile ground in genomes for minimal-edit regulatory design.
Project description
fertilizer
The Fertile Ground Hypothesis is that genomes are full of "almost-regulatory" regions — sequences that do not do anything on their own, but can be minimally edited to achieve subtle and precise activity. Many near-motifs, for example, sit one or two substitutions away from binding a transcription factor and recruiting its downstream regulatory activity. fertilizer helps identify the fertile ground in a genome that is most useful for your design task.
[!WARNING] Everything about this package, including the unit tests and this README (except for this line), was generated by Claude. We are in the process of validating everything, but use with caution for now.
Installation
fertilizer is installable with uv. The PyPI
distribution is named fertilizer-genomics (the fertilizer name was taken),
but the importable module is still fertilizer.
# install from PyPI
uv pip install fertilizer-genomics
# or install directly from GitHub
uv pip install git+https://github.com/jmschrei/fertilizer.git
# or install from a local clone
git clone https://github.com/jmschrei/fertilizer.git
cd fertilizer
uv pip install -e .
For a development environment with test and lint tooling:
uv venv
uv pip install -e ".[dev]"
pytest tests/
Usage
fertilizer exposes two subcommands. Typical workflow:
fertilizer extract -w A.bw B.bw C.bw -b regions.bed -o signals.tsv
fertilizer diff -i signals.tsv -c A B C -o diff.tsv
fertilizer extract — signal aggregation
Compute the mean signal in each bigWig over each region in the concatenated BED input. One row per region, one column per bigWig. Input row order is preserved.
| flag | description |
|---|---|
-w, --bigwigs |
one or more bigWig signal tracks |
-b, --beds |
one or more BED region files (first three columns are used; # comments are skipped) |
-o, --output |
path to the output TSV |
-s, --stat |
per-region summary statistic: mean (default), max, min, sum, std, coverage. Maps directly to pyBigWig's stats(type=...) |
-j, --n-jobs |
parallel workers (default -1, all cores) |
Zeros never mean "missing" — the output is always numeric, never NaN. A region whose mean signal genuinely resolves to zero (empty bigWig, uncovered span) is reported as 0.0 silently. A region with a locus-level problem (unknown chromosome, coordinates past the end of the chromosome, zero-length interval, negative start) is also reported as 0.0 but triggers a single FertilizerWarning — one warning per distinct issue type per run, regardless of how many rows or bigWigs were affected.
Example signals.tsv:
chrom start end A B C
chr1 0 500 2.0 7.0 3.1
chr1 500 1000 6.0 7.0 5.8
chr1 400 600 4.0 7.0 4.4
chr2 0 100 0.0 0.0 0.0
Column names come from each bigWig's filename stem, so passing two bigWigs with the same basename (even from different directories) is rejected up front.
fertilizer diff — differential activity
Identify loci that vary across two or more conditions. Takes a TSV whose columns include (at minimum) the non-negative numeric columns named via -c — the output of fertilizer extract, which also carries chrom/start/end, is the canonical input and is passed through verbatim — names the columns to compare, and writes a filtered TSV containing only loci that pass the significance threshold, with effect size, p-value, q-value, and the name of the condition carrying the highest signal. Any columns present in the input beyond the tested conditions are preserved in the output.
The test is a negative-binomial GLM likelihood-ratio test per locus, inspired by DESeq2 (Love, Huber, Anders, 2014) and adapted to the one-replicate-per-condition setting this package targets. Steps:
- Size factors. DESeq2's median-of-ratios, computed jointly across all conditions from loci with positive signal in every condition.
- Dispersion. With no within-condition replicates, per-locus variance cannot be estimated from repeated observations. A dispersion parameter
αis estimated across loci under the null-majority assumption. Default (--fit-type common) uses the median of per-locus method-of-moments estimates, scaled by a closed-form correction for the small-df median-of-χ² bias; an alternative parametric trendα(μ) = a/μ + b(--fit-type parametric) is available, fit by robust (MAD-trimmed) weighted least squares and subject to the same scale correction. - NB-GLM LRT + BH. Per locus, the null (all conditions share one mean
μ₀) is fit by vectorized Newton iteration on the intercept-only NB score equation; the full model is saturated (μⱼ = Xⱼ). The statisticT = 2·(ℓ_full − ℓ_null)is distributed as χ²(K−1) asymptotically under the null. Benjamini–Hochberg q-values control the FDR across loci.
FDR assumes independent (or positively dependent) loci. BH controls FDR under independence or positive regression dependency (PRDS). BigWig signal on adjacent windows is correlated — overlapping tiling windows, shared peaks, or bin sizes smaller than the underlying signal's autocorrelation length all introduce dependence. For typical "one row per peak / per gene" inputs this is fine. For dense sliding-window inputs (e.g. 100 bp windows stepped every 50 bp), q-values will be optimistic; prefer non-overlapping windows or thin by autocorrelation length before trusting the FDR.
Columns appended to the output:
| column | meaning |
|---|---|
effect_size |
max_k |log2(X_k/s_k + pc) − log2(grand_mean + pc)|, i.e. the largest fold change of any single condition vs the per-locus grand mean on the size-factor-normalized scale (computed as a log-difference; equivalent to a log2 ratio when pc is small) |
p_value |
χ²(K−1) p-value from the NB-GLM LRT |
q_value |
Benjamini–Hochberg q-value |
max_condition |
name of the condition column with the highest normalized signal |
Size factors, the estimated dispersion fit, its trend coefficients, and the number of kept/total loci are printed to stderr. The dispersion-fit label is one of:
| label | meaning |
|---|---|
common |
--fit-type common succeeded (single α = median of per-locus MoM estimates) |
parametric |
--fit-type parametric succeeded (fitted α(μ) = a/μ + b) |
common-fallback |
--fit-type parametric was requested but failed; fell back to the common fit |
override |
--dispersion was supplied; the fixed α was used at every locus |
zero |
--fit-type zero was supplied; Poisson was forced at every locus |
Size-factor spread, Poisson fallbacks, --fit-type zero, common-fallback, and K=2 calibration all emit a FertilizerDiffWarning on stderr.
| CLI flag | effect |
|---|---|
-i, --input |
input TSV |
-c, --conditions |
two or more column names to compare |
-o, --output |
output TSV, filtered to loci passing the threshold, with extra columns appended |
--q-threshold |
keep loci with q ≤ this (default 0.05; set to 1.0 to keep all rows) |
--p-threshold |
additionally keep only loci with raw p ≤ this (default: off) |
--fit-type |
dispersion model: common (default, median of MoM estimates, bias-corrected), parametric (fits α(μ) = a/μ + b), or zero (forces Poisson — diagnostic only, strictly anti-conservative if real overdispersion exists) |
--min-signal |
minimum mean normalized signal for loci included in dispersion estimation (default 5.0) |
--dispersion |
override the fitted α with a fixed value applied to every locus; bypasses --fit-type and --min-signal entirely and sets dispersion_fit = override. Useful for sensitivity analyses (e.g. re-run at 0.05 and 0.10 to see how much calls depend on α) |
--pseudocount |
pseudocount for the effect-size log2 transform only; does not affect the LRT. Must be > 0 (default 0.5) |
Differences from DESeq2 (non-exhaustive):
- Per-locus dispersion MLE. DESeq2 uses Cox-Reid adjusted profile likelihood; we use method-of-moments. MoM is less efficient per locus but is consistent and does not fail to converge.
- Shrinkage. DESeq2 shrinks per-locus dispersion toward the trend via a log-normal empirical-Bayes prior and retains dispersion outliers. We use the trend value directly for every locus (equivalent to infinite shrinkage, no outlier retention) — robust for the small-K setting, but cannot capture genuinely heterogeneous per-locus dispersion.
- Bias correction. Because median(χ²(K−1))/(K−1) < 1 for small K (0.69 at K=3, 0.84 at K=5, 0.91 at K=8), median-of-MoM underestimates α. We apply the closed-form correction.
- Log2 fold change shrinkage. DESeq2 optionally shrinks LFC estimates (apeglm / ashr); we report raw
max |log2FC vs grand mean|as the effect size. - Observation-level outliers. DESeq2 uses Cook's distance to flag and optionally refit without outliers. We don't.
- Independent filtering. DESeq2 filters low-signal loci out of multiple-testing correction to maximize power at a given FDR. We don't — use
--min-signal(dispersion-only) or pre-filter the input TSV if you want this. - Arbitrary designs. DESeq2's LRT supports any
fullvsreducedformula. We hard-code the all-conditions-equal null vs per-condition-mean full. - Integer counts. DESeq2 is designed for integer RNA-seq counts; the NB likelihood here is evaluated with
scipy.special.gammalnand is numerically correct for any non-negative float input. (This matches the common practice of passing fractional RSEM/salmon expected counts to DESeq2 viatximport, and is required here because bigWig region means are real-valued.)
Known calibration behavior. Our test suite verifies Type-I error at α=0.05 stays under 0.08 (~1.6× nominal) on Poisson nulls for K ∈ {3, 5, 8} and μ ∈ {30, 100, 500}, and under 0.09 (~1.8× nominal) on NB nulls for K ∈ {3, 5, 8}, μ = 100, α_true ∈ {0.05, 0.10}. At low μ (< ~5) the delta-method and the median-bias correction both degrade, and low-μ loci are excluded from dispersion estimation via --min-signal. If you care about exact FDR for a specific dataset, cross-validate against DESeq2.
Supply a mix of positive and negative loci. The size-factor and dispersion steps both assume that most loci are not differential — the "null majority" is what anchors the normalization and the variance estimate. If you run
diffon a set of regions pre-filtered to be those you expect to change, you will get sub-optimal results: the estimated library-size differences will absorb real biological differences, and the dispersion estimate will be inflated by the true positives. For best results pass a genome-wide or background-matched set of loci containing both putative-differential and expected-stable regions.
Example (three conditions):
fertilizer diff -i signals.tsv -c A B C -o diff.tsv --q-threshold 0.05
Output diff.tsv (filtered, numbers illustrative):
chrom start end A B C effect_size p_value q_value max_condition
chr1 0 500 2.0 2.1 8.4 2.07 0.0004 0.0032 C
chr3 1200 1700 15.2 3.6 4.1 2.08 0.0011 0.0060 A
chr7 900 1400 8.8 8.9 30.1 1.78 0.0018 0.0090 C
Python API
Both subcommands are thin wrappers around library functions. The same work can be done from Python:
import numpy as np
import pandas as pd
from fertilizer.extract import bigwig_region_means, load_regions, FertilizerWarning
from fertilizer.diff import (
differential_analysis,
size_factors,
bh_qvalues,
DiffResult,
FertilizerDiffWarning,
)
# --- extract: region means for one bigWig -------------------------
regions = load_regions(["regions.bed"]) # chrom/start/end
values, issues = bigwig_region_means(regions, "A.bw") # np.ndarray, set[str]
# `issues` is a subset of {"missing_chrom", "out_of_bounds", "invalid_region"}
# --- diff: NB-GLM LRT on a (n_loci, n_conditions) array -----------
counts = pd.read_csv("signals.tsv", sep="\t")[["A", "B", "C"]].to_numpy(float)
res: DiffResult = differential_analysis(counts, fit_type="common")
# res.p_value, res.q_value, res.effect_size, res.size_factors,
# res.per_locus_dispersion, res.dispersion_fit, res.dispersion_trend,
# res.max_condition_idx (all np.ndarray except the two scalars)
FertilizerWarning (locus-level issues from extract) and FertilizerDiffWarning (size-factor spread, Poisson fallbacks from diff) are both UserWarning subclasses — catch them with warnings.catch_warnings() or filter them with warnings.simplefilter(..., FertilizerWarning).
Project layout
fertilizer/
├── pyproject.toml # package metadata + uv/hatchling build config
├── README.md
├── LICENSE
├── src/
│ └── fertilizer/
│ ├── __init__.py
│ ├── cli.py # top-level argparse dispatcher
│ ├── extract.py # signal aggregation + `extract` subcommand
│ └── diff.py # differential-activity analysis + `diff` subcommand
└── tests/
├── test_cli.py
└── test_diff.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fertilizer_genomics-0.1.0.tar.gz.
File metadata
- Download URL: fertilizer_genomics-0.1.0.tar.gz
- Upload date:
- Size: 28.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47f9916ddb1b5ee90e78d2aca710719c2c49bd3e394b19d8bc40c3bf2bcb3340
|
|
| MD5 |
5b9c9e27398affceafa1eb0761801796
|
|
| BLAKE2b-256 |
85565686201fbf666e7662631f280ce764e056bfb83aa3dae28181c2b1e9754c
|
File details
Details for the file fertilizer_genomics-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fertilizer_genomics-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
430ada73b9fca16c733446a9be244d61dea80bea398c93e413b8f3181868eb6c
|
|
| MD5 |
c432149fb1141f38ffaf58017cb1a73d
|
|
| BLAKE2b-256 |
bfb75ec5825cac4336cddf2bef8b2c0b112bb0c6a0c013261df674cf95fa1f54
|