Skip to main content

Predict the impact of promoter variants on gene expression

Project description

PromoterAI

This repository contains the source code for PromoterAI, a deep learning model for predicting the impact of promoter variants on gene expression, as described in Jaganathan, Ersaro, Novakovsky et al., Science (2025).

PromoterAI precomputed scores for all human promoter single nucleotide variants are freely available for academic and non-commercial research use. Please complete the license agreement; the download link will be shared via email shortly after submission. Scores range from –1 to 1, with negative values indicating under-expression and positive values indicating over-expression. Recommended thresholds are ±0.1, ±0.2, and ±0.5.

Installation

The simplest way to install PromoterAI for variant scoring is via:

pip install promoterai

For model training or to work directly with the source code, install PromoterAI by cloning the repository:

git clone https://github.com/Illumina/PromoterAI
cd PromoterAI
pip install .

PromoterAI supports both CPU and GPU execution, and has been tested on H100 (TensorFlow 2.15, CUDA 12.2, cuDNN 8.9.7) and A100 (TensorFlow 2.13, CUDA 11.4, cuDNN 8.6.0) GPUs. A quick check to confirm proper setup (especially when using a different GPU or environment) is to run:

python -c "import tensorflow"

Variant scoring

To score variants, organize them into a .tsv file with the following columns: chrom, pos, ref, alt, strand. If strand cannot be specified, create separate rows for each strand and aggregate predictions. Indels must be left-normalized.

chrom	pos	ref	alt	strand
chr16	84145214	G	T	1
chr16	84145333	G	C	1
chr2	55232249	T	G	-1
chr2	55232374	C	T	-1
chr1	64918	T	TGG	1
chr1	64918	TAA	T	1

Download the appropriate reference genome .fa file, and run the following command:

promoterai \
    --model_folder path/to/model_dir \
    --var_file path/to/variant_tsv \
    --fasta_file path/to/genome_fa \
    --input_length 20480

Scores will be added as a new column labeled score, with the output file named by appending the model folder’s basename to the variant file name.

Model training

To begin, download the appropriate reference genome .fa file and regulatory profile .bigWig files. Organize the .bigWig file paths and their corresponding transformations into a .tsv file, where each row represents a prediction target, with the following columns:

  • fwd: path to the forward-strand .bigWig file
  • rev: path to the reverse-strand .bigWig file
  • xform: transformation applied to the prediction target
fwd	rev	xform
data/bigwig/ENCFF245ZZX.bigWig	data/bigwig/ENCFF245ZZX.bigWig	lambda x: np.arcsinh(np.nan_to_num(x))
data/bigwig/ENCFF279QDX.bigWig	data/bigwig/ENCFF279QDX.bigWig	lambda x: np.arcsinh(np.nan_to_num(x))
data/bigwig/ENCFF480GFU.bigWig	data/bigwig/ENCFF480GFU.bigWig	lambda x: np.arcsinh(np.nan_to_num(x))
data/bigwig/ENCFF815ONV.bigWig	data/bigwig/ENCFF815ONV.bigWig	lambda x: np.arcsinh(np.nan_to_num(x))

In addition, create a .tsv file listing the genomic positions of interest, with the following columns: chrom, pos, strand.

chrom	pos	strand
chr1	11868	1
chr1	12009	1
chr1	29569	-1
chr1	17435	-1

After preparing these files, run preprocess.sh with the paths to the genome .fa file, the profile and position .tsv files, and an output folder for writing the generated TFRecord files. For multi-species training, run the preprocessing step separately for each species. Next, run train.sh, specifying the TFRecord folder(s) and an output folder for saving the trained model. After training, run finetune.sh using the trained model as input. The fine-tuned model will be saved in a new folder with _finetune appended to the original model folder name.

Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promoterai-1.0rc2.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

promoterai-1.0rc2-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file promoterai-1.0rc2.tar.gz.

File metadata

  • Download URL: promoterai-1.0rc2.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.4

File hashes

Hashes for promoterai-1.0rc2.tar.gz
Algorithm Hash digest
SHA256 a69a8aa1fb607df7fca10f1909679a61d47704c12a6316aefe8c4810460b17f2
MD5 46c0744c9121f0e897e54ea2536ce79e
BLAKE2b-256 f88c758f4f6c89376c75d5154ef4e42250839fa416c97ee5352f2cb123952b6a

See more details on using hashes here.

File details

Details for the file promoterai-1.0rc2-py3-none-any.whl.

File metadata

  • Download URL: promoterai-1.0rc2-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.4

File hashes

Hashes for promoterai-1.0rc2-py3-none-any.whl
Algorithm Hash digest
SHA256 59bb7ed2595a33fa7fa62a6b24b6406a491172e4201bac001f5d92809d264d6f
MD5 14c9c2120666bd2c7a1360d479f6cece
BLAKE2b-256 667ed28bf70a3ba0a1405ce2304aea6dfc8f002af6144d3ec399de7286295f05

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page