Identify fertile ground in genomes for minimal-edit regulatory design.

These details have not been verified by PyPI

Project links

Project description

fertilizer

The Fertile Ground Hypothesis is that genomes are full of "almost-regulatory" regions — sequences that do not do anything on their own, but can be minimally edited to achieve subtle and precise activity. Many near-motifs, for example, sit one or two substitutions away from binding a transcription factor and recruiting its downstream regulatory activity. fertilizer helps identify the fertile ground in a genome that is most useful for your design task.

[!WARNING] Everything about this package, including the unit tests and this README (except for this line), was generated by Claude. We are in the process of validating everything, but use with caution for now.

Installation

fertilizer is installable with uv. The PyPI distribution is named fertilizer-genomics (the fertilizer name was taken), but the importable module is still fertilizer.

# install from PyPI
uv pip install fertilizer-genomics

# or install directly from GitHub
uv pip install git+https://github.com/jmschrei/fertilizer.git

# or install from a local clone
git clone https://github.com/jmschrei/fertilizer.git
cd fertilizer
uv pip install -e .

For a development environment with test and lint tooling:

uv venv
uv pip install -e ".[dev]"
pytest tests/

Usage

fertilizer exposes two subcommands. Typical workflow:

fertilizer extract -w A.bw B.bw C.bw -b regions.bed -o signals.tsv
fertilizer diff    -i signals.tsv -c A B C -o diff.tsv

`fertilizer extract` — signal aggregation

Compute the mean signal in each bigWig over each region in the concatenated BED input. One row per region, one column per bigWig. Input row order is preserved.

flag	description
`-w`, `--bigwigs`	one or more bigWig signal tracks
`-b`, `--beds`	one or more BED region files (first three columns are used; `#` comments are skipped)
`-o`, `--output`	path to the output TSV
`-s`, `--stat`	per-region summary statistic: `mean` (default), `max`, `min`, `sum`, `std`, `coverage`. Maps directly to pyBigWig's `stats(type=...)`
`-j`, `--n-jobs`	parallel workers (default `-1`, all cores)

Zeros never mean "missing" — the output is always numeric, never NaN. A region whose mean signal genuinely resolves to zero (empty bigWig, uncovered span) is reported as 0.0 silently. A region with a locus-level problem (unknown chromosome, coordinates past the end of the chromosome, zero-length interval, negative start) is also reported as 0.0 but triggers a single FertilizerWarning — one warning per distinct issue type per run, regardless of how many rows or bigWigs were affected.

Example signals.tsv:

chrom   start   end     A       B       C
chr1    0       500     2.0     7.0     3.1
chr1    500     1000    6.0     7.0     5.8
chr1    400     600     4.0     7.0     4.4
chr2    0       100     0.0     0.0     0.0

Column names come from each bigWig's filename stem, so passing two bigWigs with the same basename (even from different directories) is rejected up front.

`fertilizer diff` — differential activity

Identify loci that vary across two or more conditions. Takes a TSV whose columns include (at minimum) the non-negative numeric columns named via -c — the output of fertilizer extract, which also carries chrom/start/end, is the canonical input and is passed through verbatim — names the columns to compare, and writes a filtered TSV containing only loci that pass the significance threshold, with effect size, p-value, q-value, and the name of the condition carrying the highest signal. Any columns present in the input beyond the tested conditions are preserved in the output.

The test is a negative-binomial GLM likelihood-ratio test per locus, inspired by DESeq2 (Love, Huber, Anders, 2014) and adapted to the one-replicate-per-condition setting this package targets. Steps:

Size factors. DESeq2's median-of-ratios, computed jointly across all conditions from loci with positive signal in every condition.
Dispersion. With no within-condition replicates, per-locus variance cannot be estimated from repeated observations. A dispersion parameter α is estimated across loci under the null-majority assumption. Default (--fit-type common) uses the median of per-locus method-of-moments estimates, scaled by a closed-form correction for the small-df median-of-χ² bias; an alternative parametric trend α(μ) = a/μ + b (--fit-type parametric) is available, fit by robust (MAD-trimmed) weighted least squares and subject to the same scale correction.
NB-GLM LRT + BH. Per locus, the null (all conditions share one mean μ₀) is fit by vectorized Newton iteration on the intercept-only NB score equation; the full model is saturated (μⱼ = Xⱼ). The statistic T = 2·(ℓ_full − ℓ_null) is distributed as χ²(K−1) asymptotically under the null. Benjamini–Hochberg q-values control the FDR across loci.

FDR assumes independent (or positively dependent) loci. BH controls FDR under independence or positive regression dependency (PRDS). BigWig signal on adjacent windows is correlated — overlapping tiling windows, shared peaks, or bin sizes smaller than the underlying signal's autocorrelation length all introduce dependence. For typical "one row per peak / per gene" inputs this is fine. For dense sliding-window inputs (e.g. 100 bp windows stepped every 50 bp), q-values will be optimistic; prefer non-overlapping windows or thin by autocorrelation length before trusting the FDR.

Columns appended to the output:

column	meaning
`effect_size`	max_k \|log2(X_k/s_k + pc) − log2(grand_mean + pc)\|, i.e. the largest fold change of any single condition vs the per-locus grand mean on the size-factor-normalized scale (computed as a log-difference; equivalent to a log2 ratio when pc is small)
`p_value`	χ²(K−1) p-value from the NB-GLM LRT
`q_value`	Benjamini–Hochberg q-value
`max_condition`	name of the condition column with the highest normalized signal

Size factors, the estimated dispersion fit, its trend coefficients, and the number of kept/total loci are printed to stderr. The dispersion-fit label is one of:

label	meaning
`common`	`--fit-type common` succeeded (single α = median of per-locus MoM estimates)
`parametric`	`--fit-type parametric` succeeded (fitted `α(μ) = a/μ + b`)
`common-fallback`	`--fit-type parametric` was requested but failed; fell back to the common fit
`override`	`--dispersion` was supplied; the fixed α was used at every locus
`zero`	`--fit-type zero` was supplied; Poisson was forced at every locus

Size-factor spread, Poisson fallbacks, --fit-type zero, common-fallback, and K=2 calibration all emit a FertilizerDiffWarning on stderr.

CLI flag	effect
`-i`, `--input`	input TSV
`-c`, `--conditions`	two or more column names to compare
`-o`, `--output`	output TSV, filtered to loci passing the threshold, with extra columns appended
`--q-threshold`	keep loci with q ≤ this (default `0.05`; set to `1.0` to keep all rows)
`--p-threshold`	additionally keep only loci with raw p ≤ this (default: off)
`--fit-type`	dispersion model: `common` (default, median of MoM estimates, bias-corrected), `parametric` (fits `α(μ) = a/μ + b`), or `zero` (forces Poisson — diagnostic only, strictly anti-conservative if real overdispersion exists)
`--min-signal`	minimum mean normalized signal for loci included in dispersion estimation (default `5.0`)
`--dispersion`	override the fitted α with a fixed value applied to every locus; bypasses `--fit-type` and `--min-signal` entirely and sets `dispersion_fit = override`. Useful for sensitivity analyses (e.g. re-run at 0.05 and 0.10 to see how much calls depend on α)
`--pseudocount`	pseudocount for the effect-size log2 transform only; does not affect the LRT. Must be > 0 (default `0.5`)

Differences from DESeq2 (non-exhaustive):

Per-locus dispersion MLE. DESeq2 uses Cox-Reid adjusted profile likelihood; we use method-of-moments. MoM is less efficient per locus but is consistent and does not fail to converge.
Shrinkage. DESeq2 shrinks per-locus dispersion toward the trend via a log-normal empirical-Bayes prior and retains dispersion outliers. We use the trend value directly for every locus (equivalent to infinite shrinkage, no outlier retention) — robust for the small-K setting, but cannot capture genuinely heterogeneous per-locus dispersion.
Bias correction. Because median(χ²(K−1))/(K−1) < 1 for small K (0.69 at K=3, 0.84 at K=5, 0.91 at K=8), median-of-MoM underestimates α. We apply the closed-form correction.
Log2 fold change shrinkage. DESeq2 optionally shrinks LFC estimates (apeglm / ashr); we report raw max |log2FC vs grand mean| as the effect size.
Observation-level outliers. DESeq2 uses Cook's distance to flag and optionally refit without outliers. We don't.
Independent filtering. DESeq2 filters low-signal loci out of multiple-testing correction to maximize power at a given FDR. We don't — use --min-signal (dispersion-only) or pre-filter the input TSV if you want this.
Arbitrary designs. DESeq2's LRT supports any full vs reduced formula. We hard-code the all-conditions-equal null vs per-condition-mean full.
Integer counts. DESeq2 is designed for integer RNA-seq counts; the NB likelihood here is evaluated with scipy.special.gammaln and is numerically correct for any non-negative float input. (This matches the common practice of passing fractional RSEM/salmon expected counts to DESeq2 via tximport, and is required here because bigWig region means are real-valued.)

Known calibration behavior. Our test suite verifies Type-I error at α=0.05 stays under 0.08 (~1.6× nominal) on Poisson nulls for K ∈ {3, 5, 8} and μ ∈ {30, 100, 500}, and under 0.09 (~1.8× nominal) on NB nulls for K ∈ {3, 5, 8}, μ = 100, α_true ∈ {0.05, 0.10}. At low μ (< ~5) the delta-method and the median-bias correction both degrade, and low-μ loci are excluded from dispersion estimation via --min-signal. If you care about exact FDR for a specific dataset, cross-validate against DESeq2.

Supply a mix of positive and negative loci. The size-factor and dispersion steps both assume that most loci are not differential — the "null majority" is what anchors the normalization and the variance estimate. If you run diff on a set of regions pre-filtered to be those you expect to change, you will get sub-optimal results: the estimated library-size differences will absorb real biological differences, and the dispersion estimate will be inflated by the true positives. For best results pass a genome-wide or background-matched set of loci containing both putative-differential and expected-stable regions.

Example (three conditions):

fertilizer diff -i signals.tsv -c A B C -o diff.tsv --q-threshold 0.05

Output diff.tsv (filtered, numbers illustrative):

chrom   start   end     A     B     C       effect_size     p_value     q_value     max_condition
chr1    0       500     2.0   2.1   8.4     2.07            0.0004      0.0032      C
chr3    1200    1700    15.2  3.6   4.1     2.08            0.0011      0.0060      A
chr7    900     1400    8.8   8.9   30.1    1.78            0.0018      0.0090      C

Python API

Both subcommands are thin wrappers around library functions. The same work can be done from Python:

import numpy as np
import pandas as pd

from fertilizer.extract import bigwig_region_means, load_regions, FertilizerWarning
from fertilizer.diff import (
    differential_analysis,
    size_factors,
    bh_qvalues,
    DiffResult,
    FertilizerDiffWarning,
)

# --- extract: region means for one bigWig -------------------------
regions = load_regions(["regions.bed"])                  # chrom/start/end
values, issues = bigwig_region_means(regions, "A.bw")    # np.ndarray, set[str]
# `issues` is a subset of {"missing_chrom", "out_of_bounds", "invalid_region"}

# --- diff: NB-GLM LRT on a (n_loci, n_conditions) array -----------
counts = pd.read_csv("signals.tsv", sep="\t")[["A", "B", "C"]].to_numpy(float)
res: DiffResult = differential_analysis(counts, fit_type="common")
# res.p_value, res.q_value, res.effect_size, res.size_factors,
# res.per_locus_dispersion, res.dispersion_fit, res.dispersion_trend,
# res.max_condition_idx  (all np.ndarray except the two scalars)

FertilizerWarning (locus-level issues from extract) and FertilizerDiffWarning (size-factor spread, Poisson fallbacks from diff) are both UserWarning subclasses — catch them with warnings.catch_warnings() or filter them with warnings.simplefilter(..., FertilizerWarning).

Project layout

fertilizer/
├── pyproject.toml             # package metadata + uv/hatchling build config
├── README.md
├── LICENSE
├── src/
│   └── fertilizer/
│       ├── __init__.py
│       ├── cli.py             # top-level argparse dispatcher
│       ├── extract.py         # signal aggregation + `extract` subcommand
│       └── diff.py            # differential-activity analysis + `diff` subcommand
└── tests/
    ├── test_cli.py
    └── test_diff.py

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fertilizer_genomics-0.1.0.tar.gz (28.8 kB view details)

Uploaded Apr 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fertilizer_genomics-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded Apr 24, 2026 Python 3

File details

Details for the file fertilizer_genomics-0.1.0.tar.gz.

File metadata

Download URL: fertilizer_genomics-0.1.0.tar.gz
Upload date: Apr 24, 2026
Size: 28.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for fertilizer_genomics-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`47f9916ddb1b5ee90e78d2aca710719c2c49bd3e394b19d8bc40c3bf2bcb3340`
MD5	`5b9c9e27398affceafa1eb0761801796`
BLAKE2b-256	`85565686201fbf666e7662631f280ce764e056bfb83aa3dae28181c2b1e9754c`

See more details on using hashes here.

File details

Details for the file fertilizer_genomics-0.1.0-py3-none-any.whl.

File metadata

Download URL: fertilizer_genomics-0.1.0-py3-none-any.whl
Upload date: Apr 24, 2026
Size: 21.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for fertilizer_genomics-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`430ada73b9fca16c733446a9be244d61dea80bea398c93e413b8f3181868eb6c`
MD5	`c432149fb1141f38ffaf58017cb1a73d`
BLAKE2b-256	`bfb75ec5825cac4336cddf2bef8b2c0b112bb0c6a0c013261df674cf95fa1f54`

See more details on using hashes here.

fertilizer-genomics 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fertilizer

Installation

Usage

`fertilizer extract` — signal aggregation

`fertilizer diff` — differential activity

Python API

Project layout

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

fertilizer-genomics 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fertilizer

Installation

Usage

fertilizer extract — signal aggregation

fertilizer diff — differential activity

Python API

Project layout

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`fertilizer extract` — signal aggregation

`fertilizer diff` — differential activity