Predict the impact of promoter variants on gene expression
Project description
PromoterAI
This repository contains the source code for PromoterAI, a deep learning model for predicting the impact of promoter variants on gene expression, as described in Jaganathan, Ersaro, Novakovsky et al., Science (2025).
PromoterAI precomputed scores for all human promoter single nucleotide variants are freely available for academic and non-commercial research use. Please complete the license agreement; the download link will be shared via email shortly after submission. Scores range from –1 to 1, with negative values indicating under-expression and positive values indicating over-expression. Recommended thresholds are ±0.1, ±0.2, and ±0.5.
Installation
The simplest way to install PromoterAI for variant scoring is via:
pip install promoterai
For model training or to work directly with the source code, install PromoterAI by cloning the repository:
git clone https://github.com/Illumina/PromoterAI
cd PromoterAI
python setup.py install
PromoterAI supports both CPU and GPU execution, and has been tested on H100 (TensorFlow 2.15, CUDA 12.2, cuDNN 8.9.7) and A100 (TensorFlow 2.13, CUDA 11.4, cuDNN 8.6.0) GPUs. A quick check to confirm proper setup (especially when using a different GPU or environment) is to run:
python -c "import tensorflow"
Variant scoring
To score variants, organize them into a .tsv file with the following columns: chrom, pos, ref, alt, strand. If strand cannot be specified, create separate rows for each strand and aggregate predictions. Indels must be left-normalized.
chrom pos ref alt strand
chr16 84145214 G T 1
chr16 84145333 G C 1
chr2 55232249 T G -1
chr2 55232374 C T -1
Download the appropriate reference genome .fa file, and run the following command:
promoterai \
--model_folder path/to/model_dir \
--var_file path/to/variant_tsv \
--fasta_file path/to/genome_fa \
--input_length 20480
Scores will be added as a new column labeled score, with the output file named by appending the model folder’s basename to the variant file name.
Model training
To begin, download the appropriate reference genome .fa file and regulatory profile .bigWig files. Organize the .bigWig file paths and their corresponding transformations into a .tsv file, where each row represents a prediction target, with the following columns:
fwd: path to the forward-strand.bigWigfilerev: path to the reverse-strand.bigWigfilexform: transformation applied to the prediction target
fwd rev xform
data/bigwig/ENCFF245ZZX.bigWig data/bigwig/ENCFF245ZZX.bigWig lambda x: np.arcsinh(np.nan_to_num(x))
data/bigwig/ENCFF279QDX.bigWig data/bigwig/ENCFF279QDX.bigWig lambda x: np.arcsinh(np.nan_to_num(x))
data/bigwig/ENCFF480GFU.bigWig data/bigwig/ENCFF480GFU.bigWig lambda x: np.arcsinh(np.nan_to_num(x))
data/bigwig/ENCFF815ONV.bigWig data/bigwig/ENCFF815ONV.bigWig lambda x: np.arcsinh(np.nan_to_num(x))
In addition, create a .tsv file listing the genomic positions of interest, with the following columns: chrom, pos, strand.
chrom pos strand
chr1 11868 1
chr1 12009 1
chr1 29569 -1
chr1 17435 -1
After preparing these files, run preprocess.sh with the paths to the genome .fa file, the profile and position .tsv files, and an output folder for writing the generated TFRecord files. For multi-species training, run the preprocessing step separately for each species. Next, run train.sh, specifying the TFRecord folder(s) and an output folder for saving the trained model. After training, run finetune.sh using the trained model as input. The fine-tuned model will be saved in a new folder with _finetune appended to the original model folder name.
Contact
- Kishore Jaganathan: kjaganathan@illumina.com
- Gherman Novakovsky: gnovakovsky@illumina.com
- Kyle Farh: kfarh@illumina.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file promoterai-1.0rc1.tar.gz.
File metadata
- Download URL: promoterai-1.0rc1.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfab212c701055e9f1754cd278e344bc2977ab215eb7d1c2c962f6534d6226d6
|
|
| MD5 |
5bb7527f7a7022acc1b3a723331c8b90
|
|
| BLAKE2b-256 |
94c16dda6f65c747be1754bc6c4e20a31db17986242c59d46cea8feb2207d86b
|
File details
Details for the file promoterai-1.0rc1-py3-none-any.whl.
File metadata
- Download URL: promoterai-1.0rc1-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eaadb9fcae08455ec96f13024ddecc6e3fed91d2deb2a34e53a1d6db13960379
|
|
| MD5 |
2bd8684afbd92ac359d32f8c6a3c898a
|
|
| BLAKE2b-256 |
0c39c403c286a52ef2f535ab6a712498e39ee634c5a39169acafbd31d4a9f370
|