Skip to main content

End-to-end protein function prediction and drug candidate design

Project description

ProteinFP

End-to-end protein function prediction and drug candidate design.

Give it a UniProt ID. Get back active sites, druggable pockets, allosteric sites, EC classification, GO terms, PPI partners, therapy modality recommendations, and, with AutoDock Vina, evolved drug candidate molecules. For any protein, any disease, any organism.

pip install proteinfp
proteinfp --uniprot P28593   # Trypanothione reductase (Chagas disease)
  Protein    : Trypanothione reductase
  Gene       : TPR
  Organism   : Trypanosoma cruzi
  Confidence : VERY HIGH

  Top function     : Trypanothione is the parasite analog of glutathione
  Enzyme           : yes — EC 1.8.1.12
  Pockets          : 10 (all druggability > 0.90)
  Therapy          : SMALL_MOLECULE → active site inhibitor

What it does

ProteinFP runs 13+ prediction modules in sequence, fusing their outputs into a single ranked, confidence-weighted report.

Module What it predicts
01 AlphaFold structure + UniProt metadata
02 Surface charge, hydrophobicity, SASA
03 Catalytic residues and active site motifs
04 Druggable binding pockets (geometry + druggability score)
05 Allosteric sites (elastic network model)
06 Chemical environment of each site
07 Sequence homologs with known function (BLAST + InterPro)
08 ESM-2 protein language model embeddings (650M parameters)
09 GO term prediction (Molecular Function, Biological Process, Cellular Component)
10 Enzyme class prediction — ML ensemble (XGBoost + LightGBM + MLP, ~97% accuracy)
11 Structural analogs via Foldseek (finds same-fold proteins regardless of sequence)
12 Protein-protein interactions (STRING DB)
13 Consensus report — fuses all evidence into a ranked, confidence-scored output
14 Molecular dynamics — RMSF, flexibility, cryptic pockets (needs OpenMM)
15 De novo molecular design — evolutionary drug candidate generation (needs Vina + RDKit)
17 Post-translational modification sites and their functional consequences

GRN + SIM pipeline (disease-aware mode — requires scRNA-seq data):

Module What it does
GRN-01 scRNA-seq preprocessing — HVG selection, QC filtering
GRN-02 GENIE3 gene regulatory network reconstruction
GRN-03 Therapy modality decision — surface vs intracellular, ADC vs small molecule
SIM-01 Tumor cell environment inference from marker gene expression
SIM-02 Protein conformational ensemble in that environment
SIM-03 Drug distribution across cell compartments
SIM-04 Binding probability under real physiological conditions
SIM-05 GRN perturbation — network-level consequences of drug binding
SIM-06 Pharmacological scoring — efficacy, selectivity, resistance risk, grade A–F

Installation

pip install proteinfp

Core pipeline (Modules 01–13, 17) works out of the box. Optional features:

pip install proteinfp[ml]        # ESM-2 embeddings + ML EC classifier
pip install proteinfp[structure] # SASA/DSSP surface analysis
pip install proteinfp[chem]      # De novo molecular design (RDKit)
pip install proteinfp[grn]       # GRN/scRNA-seq modules (scanpy)
pip install proteinfp[sim]       # Molecular dynamics (OpenMM)
pip install proteinfp[all]       # Everything

For de novo design you also need AutoDock Vina.

Check what's available on your machine:

proteinfp --check-deps

Quick start

# Any protein, just a UniProt ID
proteinfp --uniprot P04637       # TP53 (human tumour suppressor)
proteinfp --uniprot P28593       # Trypanothione reductase (Chagas disease)
proteinfp --uniprot P9WGR1       # InhA (drug-resistant TB)

# Force re-run even if report already exists
proteinfp --uniprot P04637 --force

# With therapy decision + de novo molecule design
proteinfp --uniprot P28593 --therapy --denovo --vina /path/to/vina

# With molecular dynamics
proteinfp --uniprot P28593 --md

# Show all modules and their status
proteinfp --list-modules

Reports are saved to data/reports/{UNIPROT}_report.json and _report.txt.


Therapy mode

After the core pipeline runs, --therapy makes modality decisions automatically:

  • Surface protein → antibody path: ranks epitope candidates by immunogenicity and accessibility
  • Intracellular with druggable pocket → small molecule path: triggers de novo design
  • Epigenetic regulator → adds PROTAC degrader as secondary recommendation
  • Allosteric site only → allosteric small molecule
proteinfp --uniprot P28593 --therapy --denovo --vina pipeline/vina.exe
  → Primary modality : SMALL_MOLECULE
  → Confidence       : HIGH
  • Intracellular with druggable pocket P1 (vol=1800ų, drug=0.90)
  • Enzyme (EC 1.8.1.12) — active site inhibition most direct mechanism

Disease-agnostic design

The pipeline works on any protein from any organism. To switch disease context, edit one file:

# config/disease_config.yaml
disease:
  name: "TB"
  organism: "Mycobacterium tuberculosis"
  organism_id: 83332

data:
  scrnaseq_input: "data/grn/input/your_mtb_data.csv"

driver_genes:
  - katG   # isoniazid target
  - inhA   # isoniazid target
  - rpoB   # rifampicin target
  - gyrA   # fluoroquinolone target

Ready-to-use configs for LUAD, CRC, TB, and Leishmaniasis are included in the file.


Project structure

proteinfp/
├── config/
│   ├── config.yaml            ← paths, API thresholds, tool settings
│   └── disease_config.yaml    ← switch disease/organism here
├── pipeline/                  ← Modules 01–17
├── proteinfp/                 ← CLI package (pip install proteinfp)
│   ├── cli.py                 ← proteinfp --uniprot X
│   ├── orchestrator.py        ← runs all modules gracefully
│   ├── therapy.py             ← therapy decision + epitope + de novo
│   └── deps.py                ← optional dependency checker
├── sim/                       ← SIM-01 to SIM-07 (whole-cell simulation)
├── grn/                       ← GRN-01 to GRN-03 (gene regulatory network)
├── utils/                     ← config loader, PDB parser
├── tests/                     ← test suite (pytest)
├── validation/                ← validation against known drug-protein pairs
├── train/                     ← ML model training scripts
├── models/                    ← EC classifier ensemble (metadata only in repo)
└── pyproject.toml             ← pip install configuration

Running tests

python -m pytest tests/ -v

Reproducibility

All outputs are deterministic given the same input. Every inference step saves a JSON to data/intermediate/ so individual modules can be re-run or inspected without rerunning the full pipeline.


Citation

If you use ProteinFP in your research, please cite this repository. A methods paper describing the pipeline is in preparation.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proteinfp-0.1.5.tar.gz (459.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proteinfp-0.1.5-py3-none-any.whl (466.3 kB view details)

Uploaded Python 3

File details

Details for the file proteinfp-0.1.5.tar.gz.

File metadata

  • Download URL: proteinfp-0.1.5.tar.gz
  • Upload date:
  • Size: 459.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for proteinfp-0.1.5.tar.gz
Algorithm Hash digest
SHA256 21ff339be1da70dfa42969a3067c272b8e355cf097efe33c17e7595ce2ac6fa8
MD5 0922de6ea3e91b8c673a5d4c5b6c499b
BLAKE2b-256 de3d334d962a9069b7bf729b70f5420bf30018b6ca3dfe01134f4ed311950863

See more details on using hashes here.

File details

Details for the file proteinfp-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: proteinfp-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 466.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for proteinfp-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2d6385f08d314bcb635390158206f3b3aafa5d2591f0333eded3ddc4530e9849
MD5 4862217f18c4e912f87e27f56df850d1
BLAKE2b-256 085973defba2aaa37fa6479a802fed0797859d3a91006fd22f77c68ed8bf863c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page