End-to-end protein function prediction and drug candidate design
Project description
ProteinFP
End-to-end protein function prediction and drug candidate design.
Give it a UniProt ID. Get back active sites, druggable pockets, allosteric sites, EC classification, GO terms, PPI partners, therapy modality recommendations, and, with AutoDock Vina, evolved drug candidate molecules. For any protein, any disease, any organism.
pip install proteinfp
proteinfp --uniprot P28593 # Trypanothione reductase (Chagas disease)
Protein : Trypanothione reductase
Gene : TPR
Organism : Trypanosoma cruzi
Confidence : VERY HIGH
Top function : Trypanothione is the parasite analog of glutathione
Enzyme : yes — EC 1.8.1.12
Pockets : 10 (all druggability > 0.90)
Therapy : SMALL_MOLECULE → active site inhibitor
What it does
ProteinFP runs 13+ prediction modules in sequence, fusing their outputs into a single ranked, confidence-weighted report.
| Module | What it predicts |
|---|---|
| 01 | AlphaFold structure + UniProt metadata |
| 02 | Surface charge, hydrophobicity, SASA |
| 03 | Catalytic residues and active site motifs |
| 04 | Druggable binding pockets (geometry + druggability score) |
| 05 | Allosteric sites (elastic network model) |
| 06 | Chemical environment of each site |
| 07 | Sequence homologs with known function (BLAST + InterPro) |
| 08 | ESM-2 protein language model embeddings (650M parameters) |
| 09 | GO term prediction (Molecular Function, Biological Process, Cellular Component) |
| 10 | Enzyme class prediction — ML ensemble (XGBoost + LightGBM + MLP, ~97% accuracy) |
| 11 | Structural analogs via Foldseek (finds same-fold proteins regardless of sequence) |
| 12 | Protein-protein interactions (STRING DB) |
| 13 | Consensus report — fuses all evidence into a ranked, confidence-scored output |
| 14 | Molecular dynamics — RMSF, flexibility, cryptic pockets (needs OpenMM) |
| 15 | De novo molecular design — evolutionary drug candidate generation (needs Vina + RDKit) |
| 17 | Post-translational modification sites and their functional consequences |
GRN + SIM pipeline (disease-aware mode — requires scRNA-seq data):
| Module | What it does |
|---|---|
| GRN-01 | scRNA-seq preprocessing — HVG selection, QC filtering |
| GRN-02 | GENIE3 gene regulatory network reconstruction |
| GRN-03 | Therapy modality decision — surface vs intracellular, ADC vs small molecule |
| SIM-01 | Tumor cell environment inference from marker gene expression |
| SIM-02 | Protein conformational ensemble in that environment |
| SIM-03 | Drug distribution across cell compartments |
| SIM-04 | Binding probability under real physiological conditions |
| SIM-05 | GRN perturbation — network-level consequences of drug binding |
| SIM-06 | Pharmacological scoring — efficacy, selectivity, resistance risk, grade A–F |
Installation
pip install proteinfp
Core pipeline (Modules 01–13, 17) works out of the box. Optional features:
pip install proteinfp[ml] # ESM-2 embeddings + ML EC classifier
pip install proteinfp[structure] # SASA/DSSP surface analysis
pip install proteinfp[chem] # De novo molecular design (RDKit)
pip install proteinfp[grn] # GRN/scRNA-seq modules (scanpy)
pip install proteinfp[sim] # Molecular dynamics (OpenMM)
pip install proteinfp[all] # Everything
For de novo design you also need AutoDock Vina.
Check what's available on your machine:
proteinfp --check-deps
Quick start
# Any protein, just a UniProt ID
proteinfp --uniprot P04637 # TP53 (human tumour suppressor)
proteinfp --uniprot P28593 # Trypanothione reductase (Chagas disease)
proteinfp --uniprot P9WGR1 # InhA (drug-resistant TB)
# Force re-run even if report already exists
proteinfp --uniprot P04637 --force
# With therapy decision + de novo molecule design
proteinfp --uniprot P28593 --therapy --denovo --vina /path/to/vina
# With molecular dynamics
proteinfp --uniprot P28593 --md
# Show all modules and their status
proteinfp --list-modules
Reports are saved to data/reports/{UNIPROT}_report.json and _report.txt.
Therapy mode
After the core pipeline runs, --therapy makes modality decisions automatically:
- Surface protein → antibody path: ranks epitope candidates by immunogenicity and accessibility
- Intracellular with druggable pocket → small molecule path: triggers de novo design
- Epigenetic regulator → adds PROTAC degrader as secondary recommendation
- Allosteric site only → allosteric small molecule
proteinfp --uniprot P28593 --therapy --denovo --vina pipeline/vina.exe
→ Primary modality : SMALL_MOLECULE
→ Confidence : HIGH
• Intracellular with druggable pocket P1 (vol=1800ų, drug=0.90)
• Enzyme (EC 1.8.1.12) — active site inhibition most direct mechanism
Disease-agnostic design
The pipeline works on any protein from any organism. To switch disease context, edit one file:
# config/disease_config.yaml
disease:
name: "TB"
organism: "Mycobacterium tuberculosis"
organism_id: 83332
data:
scrnaseq_input: "data/grn/input/your_mtb_data.csv"
driver_genes:
- katG # isoniazid target
- inhA # isoniazid target
- rpoB # rifampicin target
- gyrA # fluoroquinolone target
Ready-to-use configs for LUAD, CRC, TB, and Leishmaniasis are included in the file.
Project structure
proteinfp/
├── config/
│ ├── config.yaml ← paths, API thresholds, tool settings
│ └── disease_config.yaml ← switch disease/organism here
├── pipeline/ ← Modules 01–17
├── proteinfp/ ← CLI package (pip install proteinfp)
│ ├── cli.py ← proteinfp --uniprot X
│ ├── orchestrator.py ← runs all modules gracefully
│ ├── therapy.py ← therapy decision + epitope + de novo
│ └── deps.py ← optional dependency checker
├── sim/ ← SIM-01 to SIM-07 (whole-cell simulation)
├── grn/ ← GRN-01 to GRN-03 (gene regulatory network)
├── utils/ ← config loader, PDB parser
├── tests/ ← test suite (pytest)
├── validation/ ← validation against known drug-protein pairs
├── train/ ← ML model training scripts
├── models/ ← EC classifier ensemble (metadata only in repo)
└── pyproject.toml ← pip install configuration
Running tests
python -m pytest tests/ -v
Reproducibility
All outputs are deterministic given the same input. Every inference step saves a JSON to data/intermediate/ so individual modules can be re-run or inspected without rerunning the full pipeline.
Citation
If you use ProteinFP in your research, please cite this repository. A methods paper describing the pipeline is in preparation.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proteinfp-0.1.5.tar.gz.
File metadata
- Download URL: proteinfp-0.1.5.tar.gz
- Upload date:
- Size: 459.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21ff339be1da70dfa42969a3067c272b8e355cf097efe33c17e7595ce2ac6fa8
|
|
| MD5 |
0922de6ea3e91b8c673a5d4c5b6c499b
|
|
| BLAKE2b-256 |
de3d334d962a9069b7bf729b70f5420bf30018b6ca3dfe01134f4ed311950863
|
File details
Details for the file proteinfp-0.1.5-py3-none-any.whl.
File metadata
- Download URL: proteinfp-0.1.5-py3-none-any.whl
- Upload date:
- Size: 466.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d6385f08d314bcb635390158206f3b3aafa5d2591f0333eded3ddc4530e9849
|
|
| MD5 |
4862217f18c4e912f87e27f56df850d1
|
|
| BLAKE2b-256 |
085973defba2aaa37fa6479a802fed0797859d3a91006fd22f77c68ed8bf863c
|