Predict the impact of promoter variants on gene expression
Project description
PromoterAI
This repository contains the source code for PromoterAI, a deep learning model for predicting the impact of promoter variants on gene expression, as described in Jaganathan, Ersaro, Novakovsky et al., Science (2025).
PromoterAI precomputed scores for all human promoter single nucleotide variants are freely available for academic and non-commercial research use. Please complete the license agreement; the download link will be shared via email shortly after submission. Scores range from –1 to 1, with negative values indicating under-expression and positive values indicating over-expression. Recommended thresholds are ±0.1, ±0.2, and ±0.5.
Installation
The simplest way to install PromoterAI for variant effect prediction is through:
pip install promoterai
For model training or to work directly with the source code, install PromoterAI by cloning the repository:
git clone https://github.com/Illumina/PromoterAI
cd PromoterAI
pip install -e .
PromoterAI supports both CPU and GPU execution, and has been tested on H100 (TensorFlow 2.15, CUDA 12.2, cuDNN 8.9.7) and A100 (TensorFlow 2.13, CUDA 11.4, cuDNN 8.6.0) GPUs. Functionality on other GPUs is expected but not officially tested.
Variant effect prediction
To score variants, organize them into a .tsv file with the following columns: chrom, pos, ref, alt, strand. If strand cannot be specified, create separate rows for each strand and aggregate predictions. Indels must be left-normalized and without special characters.
chrom pos ref alt strand
chr16 84145214 G T 1
chr16 84145333 G C 1
chr2 55232249 T G -1
chr2 55232374 C T -1
chr6 108295024 C CGG 1
chr6 108295024 CT C 1
Download the appropriate reference genome .fa file, then run the following command:
promoterai \
--model_folder path/to/model \
--var_file path/to/variant_tsv \
--fasta_file path/to/genome_fa \
--input_length 20480
Scores will be added as a new column labeled score, with the output file named by appending the model folder’s basename to the variant file name.
Model training and fine-tuning
Create a .tsv file listing the genomic positions of interest (e.g., promoters), with the following columns: chrom, pos, strand.
chrom pos strand
chr1 11868 1
chr1 12009 1
chr1 29569 -1
chr1 17435 -1
Download the appropriate reference genome .fa file and regulatory profile .bigWig files. Organize the .bigWig file paths and their corresponding transformations into a .tsv file, where each row represents a prediction target, with the following columns:
fwd: path to the forward-strand.bigWigfilerev: path to the reverse-strand.bigWigfilexform: transformation applied to the prediction target
fwd rev xform
path/to/ENCFF245ZZX.bigWig path/to/ENCFF245ZZX.bigWig lambda x: np.arcsinh(np.nan_to_num(x))
path/to/ENCFF279QDX.bigWig path/to/ENCFF279QDX.bigWig lambda x: np.arcsinh(np.nan_to_num(x))
path/to/ENCFF480GFU.bigWig path/to/ENCFF480GFU.bigWig lambda x: np.arcsinh(np.nan_to_num(x))
path/to/ENCFF815ONV.bigWig path/to/ENCFF815ONV.bigWig lambda x: np.arcsinh(np.nan_to_num(x))
Generate TFRecord files by running the following command, which can be parallelized across chromosomes for speed:
for chrom in $(cut -f1 path/to/position_tsv | sort -u | grep -v chrom)
do
python -m promoterai.preprocess \
--tfr_folder path/to/output_tfrecord \
--tss_file path/to/position_tsv \
--fasta_file path/to/genome_fa \
--bigwig_files path/to/profile_tsv \
--chrom ${chrom} \
--input_length 32768 \
--output_length 16384 \
--chunk_size 256
done
For multi-species training, repeat the steps above for each species, writing TFRecord files to separate folders. Use the command below to train a model on the generated TFRecord files:
python -m promoterai.train \
--model_folder path/to/trained_model \
--tfr_human_folder path/to/human_tfrecord \
--tfr_nonhuman_folders [path/to/mouse_tfrecord ...] \ # optional
--input_length 20480 \
--output_length 4096 \
--num_blocks 24 \
--model_dim 1024 \
--batch_size 32
Fine-tune the trained model on data/annotation/finetune_gtex.tsv using the command below:
python -m promoterai.finetune \
--model_folder path/to/trained_model \
--var_file path/to/finetune_gtex_tsv \
--fasta_file path/to/genome_fa \
--input_length 20480 \
--batch_size 8
The fine-tuned model will be saved in a new folder with _finetune appended to the trained model folder name.
Contact
- Kishore Jaganathan: kjaganathan@illumina.com
- Gherman Novakovsky: gnovakovsky@illumina.com
- Kyle Farh: kfarh@illumina.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file promoterai-1.0rc5.tar.gz.
File metadata
- Download URL: promoterai-1.0rc5.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48522abd0172b0b57bc77efc1eb81b83be8fe7f6e6b23e070816614c11f226f5
|
|
| MD5 |
7d5763faa517a42c59c136447fb1631c
|
|
| BLAKE2b-256 |
3537a0f2f69a751c24a82f08195dc8c96610bc1374af9bcc1c31e6d19ebfbb9c
|
File details
Details for the file promoterai-1.0rc5-py3-none-any.whl.
File metadata
- Download URL: promoterai-1.0rc5-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e4636dcf4740a72705fb44e1d6e8a56fb2ce7b03d7df4970f8f1c2779083917
|
|
| MD5 |
3add1a4506159d52a88b54238de07e9d
|
|
| BLAKE2b-256 |
10af896d26bbb4ad158c41ac072e6b9659234a1a534b2c5bdcbb8d3031dcacc3
|