Predict the impact of promoter variants on gene expression

Project description

PromoterAI

This repository contains the source code for PromoterAI, a deep learning model for predicting the impact of promoter variants on gene expression, as described in Jaganathan, Ersaro, Novakovsky et al., Science (2025).

PromoterAI precomputed scores for all human promoter single nucleotide variants are freely available for academic and non-commercial research use. Please complete the license agreement; the download link will be shared via email shortly after submission. Scores range from –1 to 1, with negative values indicating under-expression and positive values indicating over-expression. Recommended thresholds are ±0.1, ±0.2, and ±0.5.

Installation

The simplest way to install PromoterAI for variant effect prediction is through:

pip install promoterai

For model training or to work directly with the source code, install PromoterAI by cloning the repository:

git clone https://github.com/Illumina/PromoterAI
cd PromoterAI
pip install -e .

PromoterAI supports both CPU and GPU execution, and has been tested on H100 (TensorFlow 2.15, CUDA 12.2, cuDNN 8.9.7) and A100 (TensorFlow 2.13, CUDA 11.4, cuDNN 8.6.0) GPUs. Functionality on other GPUs is expected but not officially tested.

Variant effect prediction

To score variants, organize them into a .tsv file with the following columns: chrom, pos, ref, alt, strand. If strand cannot be specified, create separate rows for each strand and aggregate predictions. Indels must be left-normalized and without special characters.

chrom	pos	ref	alt	strand
chr16	84145214	G	T	1
chr16	84145333	G	C	1
chr2	55232249	T	G	-1
chr2	55232374	C	T	-1
chr6	108295024	C	CGG	1
chr6	108295024	CT	C	1

Download the appropriate reference genome .fa file, then run the following command:

promoterai \
    --model_folder path/to/model \
    --var_file path/to/variant_tsv \
    --fasta_file path/to/genome_fa \
    --input_length 20480

Scores will be added as a new column labeled score, with the output file named by appending the model folder’s basename to the variant file name.

Model training and fine-tuning

Create a .tsv file listing the genomic positions of interest (e.g., promoters), with the following columns: chrom, pos, strand.

chrom	pos	strand
chr1	11868	1
chr1	12009	1
chr1	29569	-1
chr1	17435	-1

Download the appropriate reference genome .fa file and regulatory profile .bigWig files. Organize the .bigWig file paths and their corresponding transformations into a .tsv file, where each row represents a prediction target, with the following columns:

fwd: path to the forward-strand .bigWig file
rev: path to the reverse-strand .bigWig file
xform: transformation applied to the prediction target

fwd	rev	xform
path/to/ENCFF245ZZX.bigWig	path/to/ENCFF245ZZX.bigWig	lambda x: np.arcsinh(np.nan_to_num(x))
path/to/ENCFF279QDX.bigWig	path/to/ENCFF279QDX.bigWig	lambda x: np.arcsinh(np.nan_to_num(x))
path/to/ENCFF480GFU.bigWig	path/to/ENCFF480GFU.bigWig	lambda x: np.arcsinh(np.nan_to_num(x))
path/to/ENCFF815ONV.bigWig	path/to/ENCFF815ONV.bigWig	lambda x: np.arcsinh(np.nan_to_num(x))

Generate TFRecord files by running the following command, which can be parallelized across chromosomes for speed:

for chrom in $(cut -f1 path/to/position_tsv | sort -u | grep -v chrom)
do
    python -m promoterai.preprocess \
        --tfr_folder path/to/output_tfrecord \
        --tss_file path/to/position_tsv \
        --fasta_file path/to/genome_fa \
        --bigwig_files path/to/profile_tsv \
        --chrom ${chrom} \
        --input_length 32768 \
        --output_length 16384 \
        --chunk_size 256
done

For multi-species training, repeat the steps above for each species, writing TFRecord files to separate folders. Use the command below to train a model on the generated TFRecord files:

python -m promoterai.train \
    --model_folder path/to/trained_model \
    --tfr_human_folder path/to/human_tfrecord \
    --tfr_nonhuman_folders [path/to/mouse_tfrecord ...] \  # optional
    --input_length 20480 \
    --output_length 4096 \
    --num_blocks 24 \
    --model_dim 1024 \
    --batch_size 32

Fine-tune the trained model on data/annotation/finetune_gtex.tsv using the command below:

python -m promoterai.finetune \
    --model_folder path/to/trained_model \
    --var_file path/to/finetune_gtex_tsv \
    --fasta_file path/to/genome_fa \
    --input_length 20480 \
    --batch_size 8

The fine-tuned model will be saved in a new folder with _finetune appended to the trained model folder name.

Contact

Kishore Jaganathan: kjaganathan@illumina.com
Gherman Novakovsky: gnovakovsky@illumina.com
Kyle Farh: kfarh@illumina.com

Project details

Release history Release notifications | RSS feed

1.0rc6 pre-release

Jun 22, 2025

This version

1.0rc5 pre-release

Jun 21, 2025

1.0rc4 pre-release

Jun 20, 2025

1.0rc3 pre-release

Jun 20, 2025

1.0rc2 pre-release

Jun 20, 2025

1.0rc1 pre-release

Jun 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promoterai-1.0rc5.tar.gz (13.2 kB view details)

Uploaded Jun 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

promoterai-1.0rc5-py3-none-any.whl (13.3 kB view details)

Uploaded Jun 21, 2025 Python 3

File details

Details for the file promoterai-1.0rc5.tar.gz.

File metadata

Download URL: promoterai-1.0rc5.tar.gz
Upload date: Jun 21, 2025
Size: 13.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.4

File hashes

Hashes for promoterai-1.0rc5.tar.gz
Algorithm	Hash digest
SHA256	`48522abd0172b0b57bc77efc1eb81b83be8fe7f6e6b23e070816614c11f226f5`
MD5	`7d5763faa517a42c59c136447fb1631c`
BLAKE2b-256	`3537a0f2f69a751c24a82f08195dc8c96610bc1374af9bcc1c31e6d19ebfbb9c`

See more details on using hashes here.

File details

Details for the file promoterai-1.0rc5-py3-none-any.whl.

File metadata

Download URL: promoterai-1.0rc5-py3-none-any.whl
Upload date: Jun 21, 2025
Size: 13.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.4

File hashes

Hashes for promoterai-1.0rc5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4e4636dcf4740a72705fb44e1d6e8a56fb2ce7b03d7df4970f8f1c2779083917`
MD5	`3add1a4506159d52a88b54238de07e9d`
BLAKE2b-256	`10af896d26bbb4ad158c41ac072e6b9659234a1a534b2c5bdcbb8d3031dcacc3`

See more details on using hashes here.

promoterai 1.0rc5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

PromoterAI

Installation

Variant effect prediction

Model training and fine-tuning

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes