Predict key regulatory sequence elements and TF-binding features distinguishing any two gene sets

Project description

PlantSetDelta

PlantSetDelta (psd) is a command-line toolkit for identifying regulatory sequence features that distinguish any two (or multiple) gene sets across plant species. It builds a feature matrix from regulatory regions around TSS and TTS, trains classical ML models via PyCaret, and exports interpretable top features.

This repository provides:

k-mer features (fast C++ backend; optional Python fallback)
TF-binding features:
- For Arabidopsis thaliana (ath): BED-peak–based TF-bin features (strict binning) and an additional sliding-window version (precomputed)
- For other species: TF-bin features predicted by a DeeperDeepSEA model trained on Arabidopsis TF targets

Design principle: psd build is offline. All required models/resources must be downloaded in advance using psd download (on a login node with internet), or supplied explicitly via paths.

Installation
Data download and offline cache
Concepts: regulatory windows, bins, TF modes
Quick start
Build: generate feature matrix
Train: AutoML with PyCaret
Top: interpret model and export top features
Input formats and requirements
Outputs
Advanced recipes
Troubleshooting

Installation

1) Recommended environment

Python 3.9–3.10
Linux recommended (C++ k-mer accelerator compiles automatically on first run)
Conda is strongly recommended for a clean environment

conda create -n psd python=3.10 -y
conda activate psd

2) Install PlantSetDelta

From pip:

pip install plantsetdelta

From source:

git clone git clone https://github.com/bwang889/plantsetdelta.git
cd plantsetdelta
pip install .

3) External dependency (important)

bedtools is required if you use Arabidopsis strict TF features computed from BED peaks (or any workflow relying on pybedtools intersection).

Install via conda:

conda install -c bioconda bedtools -y

Data download and offline cache

PlantSetDelta uses a local cache directory for large files (parquet features, deep models, TF BED peaks). You control the cache directory via:

Environment variable: PSD_DATA_DIR

If not set, the package uses its internal plantsetdelta/data directory (not recommended for cluster usage).

Recommended on HPC

Use a shared filesystem path so compute nodes can access the same files:

export PSD_DATA_DIR=/path/to/shared/psd_cache

Download precomputed data and models (online only)

Run these on a node with internet access:

# Precomputed features for built-in species
psd download --species ath
psd download --species bna
psd download --species osa
psd download --species zma

# DeeperDeepSEA models (for cross-species prediction / "other")
# Recommended: download all model variants needed for sliding/strict TF prediction
psd download --species other

# Arabidopsis TF BED peaks resource (only needed for BED-based strict TF recompute)
psd download --resource ath_tf_bed

Concepts: regulatory windows, bins, TF modes

Regulatory windows (default)

PlantSetDelta extracts regulatory regions around:

TSS: -1 kb upstream to +0.5 kb downstream
TTS: -0.5 kb upstream to +1 kb downstream

These are configurable in psd build via:

--tss-upstream, --tss-downstream
--tts-upstream, --tts-downstream

All values must be multiples of --BIN_SIZE (fixed at 100 bp in this package).

Binning

Each regulatory region is split into 100 bp bins. For the default 1500 bp windows, you get 15 bins per region.

TF features are expressed as:

TFNAME_bin_1, TFNAME_bin_2, ..., TFNAME_bin_N
Then suffixed by region during merge:
- ..._tss and ..._tts

TF feature modes: strict vs sliding-window

This package supports two TF feature generation paradigms:

1) Strict bin (non-sliding)

Each bin corresponds exactly to the genomic 100 bp interval.

Arabidopsis strict TF can be derived from BED peaks (true strict) or from a center=100 DeeperDeepSEA model (for non-ath species strict prediction).

2) Sliding-window

The model output represents a window larger than the 100 bp input bin (e.g. centered prediction window), leading to a sliding effect across bins.

`--sliding-window` switch (build time)

psd build provides:

--sliding-window yes (default): use sliding-window TF features
--sliding-window no: use strict-bin TF features

Behavior by species:

TF mode	ath	bna/osa/zma	other
sliding-window = yes	use precomputed ath sliding TF parquet	use precomputed TF parquet (already sliding)	compute TF with center=200 model (requires genome+gtf)
sliding-window = no	use precomputed ath strict TF parquet (or recompute via BED if needed)	compute TF with center=100 model (requires genome+gtf)	compute TF with center=100 model (requires genome+gtf)

Note: k-mer features are independent of this TF mode switch.

Quick start

Scenario A (simplest): built-in species, default settings, sliding-window TF

Download precomputed data (once):

export PSD_DATA_DIR=/path/to/shared/psd_cache
psd download --species ath

Prepare label file labels.csv:

gene_id,label
AT1G01010,0
AT1G01020,1
AT1G01030,1
AT1G01040,0

Build features:

psd build --species ath --label labels.csv -o out_ath

Train:

psd train -d out_ath/train_data.csv -o out_ath --n-runs 10

Export top features:

psd top -m out_ath/best_model.pkl -d out_ath/train_data.csv -o out_ath

Build: generate feature matrix

Command

psd build --species <ath|bna|osa|zma|other> --label labels.csv -o out_dir [options...]

Required inputs

--label: CSV with columns gene_id and label
If --features is not supplied, you must pass --species

Key options

TF mode switch (most important)

--sliding-window yes|no    # default: yes

Use precomputed data (built-in species)

If intervals are default and species is one of ath/bna/osa/zma, PlantSetDelta will try to use precomputed parquet data.

If you change any interval (--tss-upstream, etc.), it triggers recompute (requires genome+gtf).

Recompute from genome + annotation

For recompute (or for species other), you must provide:

--genome-fa genome.fa
--gtf annotation.gff3   # accepts GFF3 or GTF but must contain gene features

Custom intervals

All must be multiples of 100:

--tss-upstream 1000 --tss-downstream 500
--tts-upstream 500  --tts-downstream 1000

k-mer size

--k 7   # allowed: 5, 6, 7

Build examples (from simple to complex)

Example 1: Built-in species, default (sliding-window TF)

psd build --species osa --label labels.csv -o out_osa

Example 2: Arabidopsis strict TF (precomputed strict parquet)

psd build --species ath --label labels.csv -o out_ath_strict --sliding-window no

Example 3: Non-ath strict TF (forces center=100 model prediction)

This requires genome + annotation:

psd download --species other
psd build \
  --species bna \
  --label labels.csv \
  --sliding-window no \
  --genome-fa Brassica_napus.fa \
  --gtf zs11.gff3 \
  -o out_bna_strict

Example 4: Other species, default sliding-window TF (center=200 model)

psd download --species other
psd build \
  --species other \
  --label labels.csv \
  --genome-fa genome.fa \
  --gtf annotation.gff3 \
  -o out_other_sliding

Example 5: Other species, strict TF (center=100 model)

psd download --species other
psd build \
  --species other \
  --label labels.csv \
  --sliding-window no \
  --genome-fa genome.fa \
  --gtf annotation.gff3 \
  -o out_other_strict

Example 6: Custom regulatory windows (forces recompute)

psd build \
  --species ath \
  --label labels.csv \
  --tss-upstream 2000 --tss-downstream 500 \
  --tts-upstream 500 --tts-downstream 1500 \
  -o out_custom_intervals \
  --sliding-window yes \
  --genome-fa Arabidopsis.fa \
  --gtf Arabidopsis.gff3

Any custom interval requires genome+gtf because precomputed parquets only cover the default 1500 bp windows.

Example 7: Provide your own feature matrix (skip all automatic feature engineering)

If you already have a feature CSV with gene_id plus feature columns:

psd build --label labels.csv --features my_features.csv -o out_custom_features

Notes:

Your feature CSV must contain gene_id
The build step will merge labels + features, export raw_all_features*, then select top features (default logic)

Train: AutoML with PyCaret

Command

psd train -d out_dir/train_data.csv -o out_dir [options...]

What `psd train` does

For each run (seed), PlantSetDelta performs two evaluations:

Training-set cross-validation (CV) comparison
PyCaret runs k-fold cross-validation on the training partition and produces a model comparison table.
The comparison is sorted by Balanced Accuracy (custom metric added via add_metric).
Holdout test evaluation (PyCaret default split)
Using PyCaret's default train/holdout split inside setup(), each candidate model is evaluated on the holdout set via predict_model(). A second comparison table is saved for the holdout test metrics.

This gives you both:

a CV-based view (robust to data splits), and
a holdout-based view (a quick sanity check on unseen data within the same dataset).

Note: The "test" results here refer to PyCaret's internal holdout set created by setup(), not an external test file.

Key options

--n-runs: number of training runs (seeds 0..n_runs-1; default 10)
--n-jobs: number of CPU cores used by PyCaret (default 4)
--ml-models: restrict the model set (repeatable). Example:

psd train -d out/train_data.csv -o out --n-runs 10 --ml-models rf --ml-models xgboost

Use a broad model pool:

psd train -d out/train_data.csv -o out --n-runs 10 --ml-models all

Outputs

All outputs are written to --out-dir.

1) Feature engineering outputs (`psd build`)

Main outputs used for training

train_data.csv
Final training matrix (label is last column; gene_id removed).
train_data_with_gene.csv
Same as train_data.csv, but includes the gene_id column for traceability.

Raw exports (for transparency and reproducibility)

Depending on your build configuration (species, intervals, TSS/TTS enabled, --slim), you may see:

raw_kmer_features.csv, raw_tf_features.csv
Merged k-mer and TF features, with gene_id.
raw_all_features.csv, raw_all_features_with_gene.csv
Fully merged matrix (k-mer + TF + label), with and without gene_id.
If --slim is not used, intermediate matrices are exported as well:
- raw_kmer_tss.csv, raw_kmer_tts.csv
- raw_tf_tss.csv, raw_tf_tts.csv

Notes on built-in feature selection during build

During psd build, features are filtered to:

top_n = max(1, int(n_samples * 0.2))
by absolute Pearson correlation with label

This reduces dimensionality for classical ML training. If you want to change this behavior, modify the feature-selection section in plantsetdelta/cli.py.

2) Training outputs (`psd train`)

psd train produces two sets of results: CV and holdout test (PyCaret default split). All are placed under --out-dir.

Model files

best_models/best_model_<i>.pkl
Best model per run (seed), where i starts at 1.
best_model.pkl
Final selected model across runs, chosen by CV Balanced Accuracy.

Result tables (CV vs test clearly separated)

Per-seed tables (stored under model_results/):

model_results/model_comparison_cv_batch_<seed>.csv
Training-set cross-validation comparison for that seed.
model_results/model_comparison_test_batch_<seed>.csv
Holdout test comparison for that seed.

Aggregated tables (stored at the root of --out-dir):

all_model_comparison_results_cv.csv
Merged CV comparisons across all seeds.
all_model_comparison_results_test.csv
Merged holdout test comparisons across all seeds.

Plots (CV vs test clearly separated)

balanced_acc_cv_plot.pdf, auc_cv_plot.pdf
Distributions across seeds for CV metrics.
balanced_acc_test_plot.pdf, auc_test_plot.pdf
Distributions across seeds for holdout test metrics.

3) Feature interpretation outputs (`psd top`)

top10_features.csv
Top 10 features, aggregated across per-seed best models.
top10_lollipop.pdf
Lollipop plot for the top 10 features.

Advanced recipes

1) Full offline workflow on HPC (recommended)

On login node (with internet):

export PSD_DATA_DIR=/share/psd_cache
psd download --species ath
psd download --species other
psd download --resource ath_tf_bed

On compute node (no internet):

export PSD_DATA_DIR=/share/psd_cache
psd build --species other --label labels.csv --genome-fa genome.fa --gtf anno.gff3 -o out
psd train -d out/train_data.csv -o out --n-runs 10
psd top -m out/best_model.pkl -d out/train_data.csv -o out

2) Reproducible comparison: sliding vs strict TF

# Sliding-window
psd build --species ath --label labels.csv -o out_slide --sliding-window yes

# Strict-bin
psd build --species ath --label labels.csv -o out_strict --sliding-window no

Train and compare models in each output directory.

3) Reduce runtime

Reduce --n-runs (e.g., 3–5) for quick iterations
Restrict --ml-models to a small subset
Consider filtering your label set to fewer genes for pilot runs

Troubleshooting

“Model not found … Please run psd download --species other”

You are trying to compute TF features for species other (or strict TF for non-ath species) but the DeeperDeepSEA model file is missing in PSD_DATA_DIR.

Fix:

psd download --species other

Ensure PSD_DATA_DIR points to the same directory on compute nodes.

“bedtools not found” / pybedtools errors

Install bedtools:

conda install -c bioconda bedtools -y

“No gene features found” / empty FASTA

Your annotation may not contain gene entries, or gene IDs are not in the expected attributes.

Fix:

Ensure GFF3/GTF includes gene lines
Ensure gene IDs are present in ID= (GFF3) or gene_id (GTF)

Contig name mismatch

If your genome FASTA uses chr1 but GFF uses 1, the extractor will skip everything.

Fix:

Normalize contig names between FASTA and GFF/GTF before running.

TF feature matrix seems shifted / window definition confusion

This is exactly why --sliding-window exists. Use:

--sliding-window no for strict bin semantics
--sliding-window yes for sliding-window semantics

Acknowledgements

Selene SDK
PyCaret
Biopython
bedtools / pybedtools

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Feb 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plantsetdelta-0.1.0.tar.gz (41.5 kB view details)

Uploaded Feb 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

plantsetdelta-0.1.0-py3-none-any.whl (40.4 kB view details)

Uploaded Feb 2, 2026 Python 3

File details

Details for the file plantsetdelta-0.1.0.tar.gz.

File metadata

Download URL: plantsetdelta-0.1.0.tar.gz
Upload date: Feb 2, 2026
Size: 41.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for plantsetdelta-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c56bb64b5bac98ec106857449153532f3e616bd35dc241297672f78d844f68df`
MD5	`b8ef2f3a0a6cd565042b7fe50a373eeb`
BLAKE2b-256	`bc8f993ec09bc5ec55a48274da7a6167e854bffe5eb8a97ea38e2d6cc5740f1d`

See more details on using hashes here.

File details

Details for the file plantsetdelta-0.1.0-py3-none-any.whl.

File metadata

Download URL: plantsetdelta-0.1.0-py3-none-any.whl
Upload date: Feb 2, 2026
Size: 40.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for plantsetdelta-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`edebf12b532baa32d71e729c5bdcdac562e1e2e1a4b662556b6841b283df0434`
MD5	`4133dfca315c15aa1d747592386299f6`
BLAKE2b-256	`2acc71f0a8cfa5c10ef2d4daad7d78550d760d9ca7978d3be806281fa52b5ed5`

See more details on using hashes here.

plantsetdelta 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

PlantSetDelta

Contents

Installation

1) Recommended environment

2) Install PlantSetDelta

3) External dependency (important)

Data download and offline cache

Recommended on HPC

Download precomputed data and models (online only)

Concepts: regulatory windows, bins, TF modes

Regulatory windows (default)

Binning

TF feature modes: strict vs sliding-window

1) Strict bin (non-sliding)

2) Sliding-window

--sliding-window switch (build time)

Quick start

Scenario A (simplest): built-in species, default settings, sliding-window TF

Build: generate feature matrix

Command

Required inputs

Key options

TF mode switch (most important)

Use precomputed data (built-in species)

Recompute from genome + annotation

Custom intervals

k-mer size

Build examples (from simple to complex)

Example 1: Built-in species, default (sliding-window TF)

Example 2: Arabidopsis strict TF (precomputed strict parquet)

Example 3: Non-ath strict TF (forces center=100 model prediction)

Example 4: Other species, default sliding-window TF (center=200 model)

Example 5: Other species, strict TF (center=100 model)

Example 6: Custom regulatory windows (forces recompute)

Example 7: Provide your own feature matrix (skip all automatic feature engineering)

Train: AutoML with PyCaret

Command

What psd train does

Key options

Outputs

1) Feature engineering outputs (psd build)

Main outputs used for training

Raw exports (for transparency and reproducibility)

Notes on built-in feature selection during build

2) Training outputs (psd train)

Model files

Result tables (CV vs test clearly separated)

Plots (CV vs test clearly separated)

3) Feature interpretation outputs (psd top)

Advanced recipes

1) Full offline workflow on HPC (recommended)

2) Reproducible comparison: sliding vs strict TF

3) Reduce runtime

Troubleshooting

“Model not found … Please run psd download --species other”

“bedtools not found” / pybedtools errors

“No gene features found” / empty FASTA

Contig name mismatch

TF feature matrix seems shifted / window definition confusion

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

`--sliding-window` switch (build time)

What `psd train` does

1) Feature engineering outputs (`psd build`)

2) Training outputs (`psd train`)

3) Feature interpretation outputs (`psd top`)