Skip to main content

Predict key regulatory sequence elements and TF-binding features distinguishing any two gene sets

Project description

PlantSetDelta

PlantSetDelta (psd) is a command-line toolkit for identifying regulatory sequence features that distinguish any two (or multiple) gene sets across plant species. It builds a feature matrix from regulatory regions around TSS and TTS, trains classical ML models via PyCaret, and exports interpretable top features.

This repository provides:

  • k-mer features (fast C++ backend; optional Python fallback)
  • TF-binding features:
    • For Arabidopsis thaliana (ath): BED-peak–based TF-bin features (strict binning) and an additional sliding-window version (precomputed)
    • For other species: TF-bin features predicted by a DeeperDeepSEA model trained on Arabidopsis TF targets

Design principle: psd build is offline. All required models/resources must be downloaded in advance using psd download (on a login node with internet), or supplied explicitly via paths.



Contents


Installation

1) Recommended environment

  • Python 3.9–3.10
  • Linux recommended (C++ k-mer accelerator compiles automatically on first run)
  • Conda is strongly recommended for a clean environment
conda create -n psd python=3.10 -y
conda activate psd

2) Install PlantSetDelta

From pip:

pip install plantsetdelta

From source:

git clone git clone https://github.com/bwang889/plantsetdelta.git
cd plantsetdelta
pip install .

3) External dependency (important)

bedtools is required if you use Arabidopsis strict TF features computed from BED peaks (or any workflow relying on pybedtools intersection).

Install via conda:

conda install -c bioconda bedtools -y

Data download and offline cache

PlantSetDelta uses a local cache directory for large files (parquet features, deep models, TF BED peaks). You control the cache directory via:

  • Environment variable: PSD_DATA_DIR

If not set, the package uses its internal plantsetdelta/data directory (not recommended for cluster usage).

Recommended on HPC

Use a shared filesystem path so compute nodes can access the same files:

export PSD_DATA_DIR=/path/to/shared/psd_cache

Download precomputed data and models (online only)

Run these on a node with internet access:

# Precomputed features for built-in species
psd download --species ath
psd download --species bna
psd download --species osa
psd download --species zma

# DeeperDeepSEA models (for cross-species prediction / "other")
# Recommended: download all model variants needed for sliding/strict TF prediction
psd download --species other

# Arabidopsis TF BED peaks resource (only needed for BED-based strict TF recompute)
psd download --resource ath_tf_bed

Concepts: regulatory windows, bins, TF modes

Regulatory windows (default)

PlantSetDelta extracts regulatory regions around:

  • TSS: -1 kb upstream to +0.5 kb downstream
  • TTS: -0.5 kb upstream to +1 kb downstream

These are configurable in psd build via:

  • --tss-upstream, --tss-downstream
  • --tts-upstream, --tts-downstream

All values must be multiples of --BIN_SIZE (fixed at 100 bp in this package).

Binning

Each regulatory region is split into 100 bp bins. For the default 1500 bp windows, you get 15 bins per region.

TF features are expressed as:

  • TFNAME_bin_1, TFNAME_bin_2, ..., TFNAME_bin_N
  • Then suffixed by region during merge:
    • ..._tss and ..._tts

TF feature modes: strict vs sliding-window

This package supports two TF feature generation paradigms:

1) Strict bin (non-sliding)

Each bin corresponds exactly to the genomic 100 bp interval.

  • Arabidopsis strict TF can be derived from BED peaks (true strict) or from a center=100 DeeperDeepSEA model (for non-ath species strict prediction).

2) Sliding-window

The model output represents a window larger than the 100 bp input bin (e.g. centered prediction window), leading to a sliding effect across bins.

--sliding-window switch (build time)

psd build provides:

  • --sliding-window yes (default): use sliding-window TF features
  • --sliding-window no: use strict-bin TF features

Behavior by species:

TF mode ath bna/osa/zma other
sliding-window = yes use precomputed ath sliding TF parquet use precomputed TF parquet (already sliding) compute TF with center=200 model (requires genome+gtf)
sliding-window = no use precomputed ath strict TF parquet (or recompute via BED if needed) compute TF with center=100 model (requires genome+gtf) compute TF with center=100 model (requires genome+gtf)

Note: k-mer features are independent of this TF mode switch.


Quick start

Scenario A (simplest): built-in species, default settings, sliding-window TF

  1. Download precomputed data (once):
export PSD_DATA_DIR=/path/to/shared/psd_cache
psd download --species ath
  1. Prepare label file labels.csv:
gene_id,label
AT1G01010,0
AT1G01020,1
AT1G01030,1
AT1G01040,0
  1. Build features:
psd build --species ath --label labels.csv -o out_ath
  1. Train:
psd train -d out_ath/train_data.csv -o out_ath --n-runs 10
  1. Export top features:
psd top -m out_ath/best_model.pkl -d out_ath/train_data.csv -o out_ath

Build: generate feature matrix

Command

psd build --species <ath|bna|osa|zma|other> --label labels.csv -o out_dir [options...]

Required inputs

  • --label: CSV with columns gene_id and label
  • If --features is not supplied, you must pass --species

Key options

TF mode switch (most important)

--sliding-window yes|no    # default: yes

Use precomputed data (built-in species)

If intervals are default and species is one of ath/bna/osa/zma, PlantSetDelta will try to use precomputed parquet data.

If you change any interval (--tss-upstream, etc.), it triggers recompute (requires genome+gtf).

Recompute from genome + annotation

For recompute (or for species other), you must provide:

--genome-fa genome.fa
--gtf annotation.gff3   # accepts GFF3 or GTF but must contain gene features

Custom intervals

All must be multiples of 100:

--tss-upstream 1000 --tss-downstream 500
--tts-upstream 500  --tts-downstream 1000

k-mer size

--k 7   # allowed: 5, 6, 7

Build examples (from simple to complex)

Example 1: Built-in species, default (sliding-window TF)

psd build --species osa --label labels.csv -o out_osa

Example 2: Arabidopsis strict TF (precomputed strict parquet)

psd build --species ath --label labels.csv -o out_ath_strict --sliding-window no

Example 3: Non-ath strict TF (forces center=100 model prediction)

This requires genome + annotation:

psd download --species other
psd build \
  --species bna \
  --label labels.csv \
  --sliding-window no \
  --genome-fa Brassica_napus.fa \
  --gtf zs11.gff3 \
  -o out_bna_strict

Example 4: Other species, default sliding-window TF (center=200 model)

psd download --species other
psd build \
  --species other \
  --label labels.csv \
  --genome-fa genome.fa \
  --gtf annotation.gff3 \
  -o out_other_sliding

Example 5: Other species, strict TF (center=100 model)

psd download --species other
psd build \
  --species other \
  --label labels.csv \
  --sliding-window no \
  --genome-fa genome.fa \
  --gtf annotation.gff3 \
  -o out_other_strict

Example 6: Custom regulatory windows (forces recompute)

psd build \
  --species ath \
  --label labels.csv \
  --tss-upstream 2000 --tss-downstream 500 \
  --tts-upstream 500 --tts-downstream 1500 \
  -o out_custom_intervals \
  --sliding-window yes \
  --genome-fa Arabidopsis.fa \
  --gtf Arabidopsis.gff3

Any custom interval requires genome+gtf because precomputed parquets only cover the default 1500 bp windows.

Example 7: Provide your own feature matrix (skip all automatic feature engineering)

If you already have a feature CSV with gene_id plus feature columns:

psd build --label labels.csv --features my_features.csv -o out_custom_features

Notes:

  • Your feature CSV must contain gene_id
  • The build step will merge labels + features, export raw_all_features*, then select top features (default logic)

Train: AutoML with PyCaret

Command

psd train -d out_dir/train_data.csv -o out_dir [options...]

What psd train does

For each run (seed), PlantSetDelta performs two evaluations:

  1. Training-set cross-validation (CV) comparison
    PyCaret runs k-fold cross-validation on the training partition and produces a model comparison table.
    The comparison is sorted by Balanced Accuracy (custom metric added via add_metric).

  2. Holdout test evaluation (PyCaret default split)
    Using PyCaret's default train/holdout split inside setup(), each candidate model is evaluated on the holdout set via predict_model(). A second comparison table is saved for the holdout test metrics.

This gives you both:

  • a CV-based view (robust to data splits), and
  • a holdout-based view (a quick sanity check on unseen data within the same dataset).

Note: The "test" results here refer to PyCaret's internal holdout set created by setup(), not an external test file.

Key options

  • --n-runs: number of training runs (seeds 0..n_runs-1; default 10)
  • --n-jobs: number of CPU cores used by PyCaret (default 4)
  • --ml-models: restrict the model set (repeatable). Example:
psd train -d out/train_data.csv -o out --n-runs 10 --ml-models rf --ml-models xgboost

Use a broad model pool:

psd train -d out/train_data.csv -o out --n-runs 10 --ml-models all

Outputs

All outputs are written to --out-dir.

1) Feature engineering outputs (psd build)

Main outputs used for training

  • train_data.csv
    Final training matrix (label is last column; gene_id removed).

  • train_data_with_gene.csv
    Same as train_data.csv, but includes the gene_id column for traceability.

Raw exports (for transparency and reproducibility)

Depending on your build configuration (species, intervals, TSS/TTS enabled, --slim), you may see:

  • raw_kmer_features.csv, raw_tf_features.csv
    Merged k-mer and TF features, with gene_id.

  • raw_all_features.csv, raw_all_features_with_gene.csv
    Fully merged matrix (k-mer + TF + label), with and without gene_id.

  • If --slim is not used, intermediate matrices are exported as well:

    • raw_kmer_tss.csv, raw_kmer_tts.csv
    • raw_tf_tss.csv, raw_tf_tts.csv

Notes on built-in feature selection during build

During psd build, features are filtered to:

  • top_n = max(1, int(n_samples * 0.2))
  • by absolute Pearson correlation with label

This reduces dimensionality for classical ML training. If you want to change this behavior, modify the feature-selection section in plantsetdelta/cli.py.

2) Training outputs (psd train)

psd train produces two sets of results: CV and holdout test (PyCaret default split). All are placed under --out-dir.

Model files

  • best_models/best_model_<i>.pkl
    Best model per run (seed), where i starts at 1.

  • best_model.pkl
    Final selected model across runs, chosen by CV Balanced Accuracy.

Result tables (CV vs test clearly separated)

Per-seed tables (stored under model_results/):

  • model_results/model_comparison_cv_batch_<seed>.csv
    Training-set cross-validation comparison for that seed.

  • model_results/model_comparison_test_batch_<seed>.csv
    Holdout test comparison for that seed.

Aggregated tables (stored at the root of --out-dir):

  • all_model_comparison_results_cv.csv
    Merged CV comparisons across all seeds.

  • all_model_comparison_results_test.csv
    Merged holdout test comparisons across all seeds.

Plots (CV vs test clearly separated)

  • balanced_acc_cv_plot.pdf, auc_cv_plot.pdf
    Distributions across seeds for CV metrics.

  • balanced_acc_test_plot.pdf, auc_test_plot.pdf
    Distributions across seeds for holdout test metrics.

3) Feature interpretation outputs (psd top)

  • top10_features.csv
    Top 10 features, aggregated across per-seed best models.

  • top10_lollipop.pdf
    Lollipop plot for the top 10 features.

Advanced recipes

1) Full offline workflow on HPC (recommended)

On login node (with internet):

export PSD_DATA_DIR=/share/psd_cache
psd download --species ath
psd download --species other
psd download --resource ath_tf_bed

On compute node (no internet):

export PSD_DATA_DIR=/share/psd_cache
psd build --species other --label labels.csv --genome-fa genome.fa --gtf anno.gff3 -o out
psd train -d out/train_data.csv -o out --n-runs 10
psd top -m out/best_model.pkl -d out/train_data.csv -o out

2) Reproducible comparison: sliding vs strict TF

# Sliding-window
psd build --species ath --label labels.csv -o out_slide --sliding-window yes

# Strict-bin
psd build --species ath --label labels.csv -o out_strict --sliding-window no

Train and compare models in each output directory.

3) Reduce runtime

  • Reduce --n-runs (e.g., 3–5) for quick iterations
  • Restrict --ml-models to a small subset
  • Consider filtering your label set to fewer genes for pilot runs

Troubleshooting

“Model not found … Please run psd download --species other”

You are trying to compute TF features for species other (or strict TF for non-ath species) but the DeeperDeepSEA model file is missing in PSD_DATA_DIR.

Fix:

psd download --species other

Ensure PSD_DATA_DIR points to the same directory on compute nodes.

“bedtools not found” / pybedtools errors

Install bedtools:

conda install -c bioconda bedtools -y

“No gene features found” / empty FASTA

Your annotation may not contain gene entries, or gene IDs are not in the expected attributes.

Fix:

  • Ensure GFF3/GTF includes gene lines
  • Ensure gene IDs are present in ID= (GFF3) or gene_id (GTF)

Contig name mismatch

If your genome FASTA uses chr1 but GFF uses 1, the extractor will skip everything.

Fix:

  • Normalize contig names between FASTA and GFF/GTF before running.

TF feature matrix seems shifted / window definition confusion

This is exactly why --sliding-window exists. Use:

  • --sliding-window no for strict bin semantics
  • --sliding-window yes for sliding-window semantics

Acknowledgements

  • Selene SDK
  • PyCaret
  • Biopython
  • bedtools / pybedtools

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plantsetdelta-0.1.0.tar.gz (41.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

plantsetdelta-0.1.0-py3-none-any.whl (40.4 kB view details)

Uploaded Python 3

File details

Details for the file plantsetdelta-0.1.0.tar.gz.

File metadata

  • Download URL: plantsetdelta-0.1.0.tar.gz
  • Upload date:
  • Size: 41.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for plantsetdelta-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c56bb64b5bac98ec106857449153532f3e616bd35dc241297672f78d844f68df
MD5 b8ef2f3a0a6cd565042b7fe50a373eeb
BLAKE2b-256 bc8f993ec09bc5ec55a48274da7a6167e854bffe5eb8a97ea38e2d6cc5740f1d

See more details on using hashes here.

File details

Details for the file plantsetdelta-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: plantsetdelta-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for plantsetdelta-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 edebf12b532baa32d71e729c5bdcdac562e1e2e1a4b662556b6841b283df0434
MD5 4133dfca315c15aa1d747592386299f6
BLAKE2b-256 2acc71f0a8cfa5c10ef2d4daad7d78550d760d9ca7978d3be806281fa52b5ed5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page