Predict key regulatory sequence elements and TF-binding features distinguishing any two gene sets
Project description
PlantSetDelta
PlantSetDelta (psd) is a command-line toolkit for identifying regulatory sequence features that distinguish any two (or multiple) gene sets across plant species. It builds a feature matrix from regulatory regions around TSS and TTS, trains classical ML models via PyCaret, and exports interpretable top features.
This repository provides:
- k-mer features (fast C++ backend; optional Python fallback)
- TF-binding features:
- For Arabidopsis thaliana (ath): BED-peak–based TF-bin features (strict binning) and an additional sliding-window version (precomputed)
- For other species: TF-bin features predicted by a DeeperDeepSEA model trained on Arabidopsis TF targets
Design principle:
psd buildis offline. All required models/resources must be downloaded in advance usingpsd download(on a login node with internet), or supplied explicitly via paths.
Contents
- Installation
- Data download and offline cache
- Concepts: regulatory windows, bins, TF modes
- Quick start
- Build: generate feature matrix
- Train: AutoML with PyCaret
- Top: interpret model and export top features
- Input formats and requirements
- Outputs
- Advanced recipes
- Troubleshooting
Installation
1) Recommended environment
- Python 3.9–3.10
- Linux recommended (C++ k-mer accelerator compiles automatically on first run)
- Conda is strongly recommended for a clean environment
conda create -n psd python=3.10 -y
conda activate psd
2) Install PlantSetDelta
From pip:
pip install plantsetdelta
From source:
git clone git clone https://github.com/bwang889/plantsetdelta.git
cd plantsetdelta
pip install .
3) External dependency (important)
bedtools is required if you use Arabidopsis strict TF features computed from BED peaks (or any workflow relying on pybedtools intersection).
Install via conda:
conda install -c bioconda bedtools -y
Data download and offline cache
PlantSetDelta uses a local cache directory for large files (parquet features, deep models, TF BED peaks). You control the cache directory via:
- Environment variable:
PSD_DATA_DIR
If not set, the package uses its internal plantsetdelta/data directory (not recommended for cluster usage).
Recommended on HPC
Use a shared filesystem path so compute nodes can access the same files:
export PSD_DATA_DIR=/path/to/shared/psd_cache
Download precomputed data and models (online only)
Run these on a node with internet access:
# Precomputed features for built-in species
psd download --species ath
psd download --species bna
psd download --species osa
psd download --species zma
# DeeperDeepSEA models (for cross-species prediction / "other")
# Recommended: download all model variants needed for sliding/strict TF prediction
psd download --species other
# Arabidopsis TF BED peaks resource (only needed for BED-based strict TF recompute)
psd download --resource ath_tf_bed
Concepts: regulatory windows, bins, TF modes
Regulatory windows (default)
PlantSetDelta extracts regulatory regions around:
- TSS:
-1 kbupstream to+0.5 kbdownstream - TTS:
-0.5 kbupstream to+1 kbdownstream
These are configurable in psd build via:
--tss-upstream,--tss-downstream--tts-upstream,--tts-downstream
All values must be multiples of --BIN_SIZE (fixed at 100 bp in this package).
Binning
Each regulatory region is split into 100 bp bins. For the default 1500 bp windows, you get 15 bins per region.
TF features are expressed as:
TFNAME_bin_1,TFNAME_bin_2, ...,TFNAME_bin_N- Then suffixed by region during merge:
..._tssand..._tts
TF feature modes: strict vs sliding-window
This package supports two TF feature generation paradigms:
1) Strict bin (non-sliding)
Each bin corresponds exactly to the genomic 100 bp interval.
- Arabidopsis strict TF can be derived from BED peaks (true strict) or from a center=100 DeeperDeepSEA model (for non-ath species strict prediction).
2) Sliding-window
The model output represents a window larger than the 100 bp input bin (e.g. centered prediction window), leading to a sliding effect across bins.
--sliding-window switch (build time)
psd build provides:
--sliding-window yes(default): use sliding-window TF features--sliding-window no: use strict-bin TF features
Behavior by species:
| TF mode | ath | bna/osa/zma | other |
|---|---|---|---|
| sliding-window = yes | use precomputed ath sliding TF parquet | use precomputed TF parquet (already sliding) | compute TF with center=200 model (requires genome+gtf) |
| sliding-window = no | use precomputed ath strict TF parquet (or recompute via BED if needed) | compute TF with center=100 model (requires genome+gtf) | compute TF with center=100 model (requires genome+gtf) |
Note: k-mer features are independent of this TF mode switch.
Quick start
Scenario A (simplest): built-in species, default settings, sliding-window TF
- Download precomputed data (once):
export PSD_DATA_DIR=/path/to/shared/psd_cache
psd download --species ath
- Prepare label file
labels.csv:
gene_id,label
AT1G01010,0
AT1G01020,1
AT1G01030,1
AT1G01040,0
- Build features:
psd build --species ath --label labels.csv -o out_ath
- Train:
psd train -d out_ath/train_data.csv -o out_ath --n-runs 10
- Export top features:
psd top -m out_ath/best_model.pkl -d out_ath/train_data.csv -o out_ath
Build: generate feature matrix
Command
psd build --species <ath|bna|osa|zma|other> --label labels.csv -o out_dir [options...]
Required inputs
--label: CSV with columnsgene_idandlabel- If
--featuresis not supplied, you must pass--species
Key options
TF mode switch (most important)
--sliding-window yes|no # default: yes
Use precomputed data (built-in species)
If intervals are default and species is one of ath/bna/osa/zma, PlantSetDelta will try to use precomputed parquet data.
If you change any interval (--tss-upstream, etc.), it triggers recompute (requires genome+gtf).
Recompute from genome + annotation
For recompute (or for species other), you must provide:
--genome-fa genome.fa
--gtf annotation.gff3 # accepts GFF3 or GTF but must contain gene features
Custom intervals
All must be multiples of 100:
--tss-upstream 1000 --tss-downstream 500
--tts-upstream 500 --tts-downstream 1000
k-mer size
--k 7 # allowed: 5, 6, 7
Build examples (from simple to complex)
Example 1: Built-in species, default (sliding-window TF)
psd build --species osa --label labels.csv -o out_osa
Example 2: Arabidopsis strict TF (precomputed strict parquet)
psd build --species ath --label labels.csv -o out_ath_strict --sliding-window no
Example 3: Non-ath strict TF (forces center=100 model prediction)
This requires genome + annotation:
psd download --species other
psd build \
--species bna \
--label labels.csv \
--sliding-window no \
--genome-fa Brassica_napus.fa \
--gtf zs11.gff3 \
-o out_bna_strict
Example 4: Other species, default sliding-window TF (center=200 model)
psd download --species other
psd build \
--species other \
--label labels.csv \
--genome-fa genome.fa \
--gtf annotation.gff3 \
-o out_other_sliding
Example 5: Other species, strict TF (center=100 model)
psd download --species other
psd build \
--species other \
--label labels.csv \
--sliding-window no \
--genome-fa genome.fa \
--gtf annotation.gff3 \
-o out_other_strict
Example 6: Custom regulatory windows (forces recompute)
psd build \
--species ath \
--label labels.csv \
--tss-upstream 2000 --tss-downstream 500 \
--tts-upstream 500 --tts-downstream 1500 \
-o out_custom_intervals \
--sliding-window yes \
--genome-fa Arabidopsis.fa \
--gtf Arabidopsis.gff3
Any custom interval requires genome+gtf because precomputed parquets only cover the default 1500 bp windows.
Example 7: Provide your own feature matrix (skip all automatic feature engineering)
If you already have a feature CSV with gene_id plus feature columns:
psd build --label labels.csv --features my_features.csv -o out_custom_features
Notes:
- Your feature CSV must contain
gene_id - The build step will merge labels + features, export
raw_all_features*, then select top features (default logic)
Train: AutoML with PyCaret
Command
psd train -d out_dir/train_data.csv -o out_dir [options...]
What psd train does
For each run (seed), PlantSetDelta performs two evaluations:
-
Training-set cross-validation (CV) comparison
PyCaret runs k-fold cross-validation on the training partition and produces a model comparison table.
The comparison is sorted by Balanced Accuracy (custom metric added viaadd_metric). -
Holdout test evaluation (PyCaret default split)
Using PyCaret's default train/holdout split insidesetup(), each candidate model is evaluated on the holdout set viapredict_model(). A second comparison table is saved for the holdout test metrics.
This gives you both:
- a CV-based view (robust to data splits), and
- a holdout-based view (a quick sanity check on unseen data within the same dataset).
Note: The "test" results here refer to PyCaret's internal holdout set created by
setup(), not an external test file.
Key options
--n-runs: number of training runs (seeds 0..n_runs-1; default 10)--n-jobs: number of CPU cores used by PyCaret (default 4)--ml-models: restrict the model set (repeatable). Example:
psd train -d out/train_data.csv -o out --n-runs 10 --ml-models rf --ml-models xgboost
Use a broad model pool:
psd train -d out/train_data.csv -o out --n-runs 10 --ml-models all
Outputs
All outputs are written to --out-dir.
1) Feature engineering outputs (psd build)
Main outputs used for training
-
train_data.csv
Final training matrix (label is last column;gene_idremoved). -
train_data_with_gene.csv
Same astrain_data.csv, but includes thegene_idcolumn for traceability.
Raw exports (for transparency and reproducibility)
Depending on your build configuration (species, intervals, TSS/TTS enabled, --slim), you may see:
-
raw_kmer_features.csv,raw_tf_features.csv
Merged k-mer and TF features, withgene_id. -
raw_all_features.csv,raw_all_features_with_gene.csv
Fully merged matrix (k-mer + TF + label), with and withoutgene_id. -
If
--slimis not used, intermediate matrices are exported as well:raw_kmer_tss.csv,raw_kmer_tts.csvraw_tf_tss.csv,raw_tf_tts.csv
Notes on built-in feature selection during build
During psd build, features are filtered to:
top_n = max(1, int(n_samples * 0.2))- by absolute Pearson correlation with
label
This reduces dimensionality for classical ML training. If you want to change this behavior, modify the feature-selection
section in plantsetdelta/cli.py.
2) Training outputs (psd train)
psd train produces two sets of results: CV and holdout test (PyCaret default split). All are placed under --out-dir.
Model files
-
best_models/best_model_<i>.pkl
Best model per run (seed), whereistarts at 1. -
best_model.pkl
Final selected model across runs, chosen by CV Balanced Accuracy.
Result tables (CV vs test clearly separated)
Per-seed tables (stored under model_results/):
-
model_results/model_comparison_cv_batch_<seed>.csv
Training-set cross-validation comparison for that seed. -
model_results/model_comparison_test_batch_<seed>.csv
Holdout test comparison for that seed.
Aggregated tables (stored at the root of --out-dir):
-
all_model_comparison_results_cv.csv
Merged CV comparisons across all seeds. -
all_model_comparison_results_test.csv
Merged holdout test comparisons across all seeds.
Plots (CV vs test clearly separated)
-
balanced_acc_cv_plot.pdf,auc_cv_plot.pdf
Distributions across seeds for CV metrics. -
balanced_acc_test_plot.pdf,auc_test_plot.pdf
Distributions across seeds for holdout test metrics.
3) Feature interpretation outputs (psd top)
-
top10_features.csv
Top 10 features, aggregated across per-seed best models. -
top10_lollipop.pdf
Lollipop plot for the top 10 features.
Advanced recipes
1) Full offline workflow on HPC (recommended)
On login node (with internet):
export PSD_DATA_DIR=/share/psd_cache
psd download --species ath
psd download --species other
psd download --resource ath_tf_bed
On compute node (no internet):
export PSD_DATA_DIR=/share/psd_cache
psd build --species other --label labels.csv --genome-fa genome.fa --gtf anno.gff3 -o out
psd train -d out/train_data.csv -o out --n-runs 10
psd top -m out/best_model.pkl -d out/train_data.csv -o out
2) Reproducible comparison: sliding vs strict TF
# Sliding-window
psd build --species ath --label labels.csv -o out_slide --sliding-window yes
# Strict-bin
psd build --species ath --label labels.csv -o out_strict --sliding-window no
Train and compare models in each output directory.
3) Reduce runtime
- Reduce
--n-runs(e.g., 3–5) for quick iterations - Restrict
--ml-modelsto a small subset - Consider filtering your label set to fewer genes for pilot runs
Troubleshooting
“Model not found … Please run psd download --species other”
You are trying to compute TF features for species other (or strict TF for non-ath species) but the DeeperDeepSEA model file is missing in PSD_DATA_DIR.
Fix:
psd download --species other
Ensure PSD_DATA_DIR points to the same directory on compute nodes.
“bedtools not found” / pybedtools errors
Install bedtools:
conda install -c bioconda bedtools -y
“No gene features found” / empty FASTA
Your annotation may not contain gene entries, or gene IDs are not in the expected attributes.
Fix:
- Ensure GFF3/GTF includes
genelines - Ensure gene IDs are present in
ID=(GFF3) orgene_id(GTF)
Contig name mismatch
If your genome FASTA uses chr1 but GFF uses 1, the extractor will skip everything.
Fix:
- Normalize contig names between FASTA and GFF/GTF before running.
TF feature matrix seems shifted / window definition confusion
This is exactly why --sliding-window exists. Use:
--sliding-window nofor strict bin semantics--sliding-window yesfor sliding-window semantics
Acknowledgements
- Selene SDK
- PyCaret
- Biopython
- bedtools / pybedtools
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file plantsetdelta-0.1.0.tar.gz.
File metadata
- Download URL: plantsetdelta-0.1.0.tar.gz
- Upload date:
- Size: 41.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c56bb64b5bac98ec106857449153532f3e616bd35dc241297672f78d844f68df
|
|
| MD5 |
b8ef2f3a0a6cd565042b7fe50a373eeb
|
|
| BLAKE2b-256 |
bc8f993ec09bc5ec55a48274da7a6167e854bffe5eb8a97ea38e2d6cc5740f1d
|
File details
Details for the file plantsetdelta-0.1.0-py3-none-any.whl.
File metadata
- Download URL: plantsetdelta-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
edebf12b532baa32d71e729c5bdcdac562e1e2e1a4b662556b6841b283df0434
|
|
| MD5 |
4133dfca315c15aa1d747592386299f6
|
|
| BLAKE2b-256 |
2acc71f0a8cfa5c10ef2d4daad7d78550d760d9ca7978d3be806281fa52b5ed5
|