Skip to main content

Deduce the protein from a EM density

Project description

protein-detective

Documentation CI Research Software Directory Badge PyPI DOI

Python package to detect proteins in EM density maps.

It uses

  • protein-quest to search, retrieve and filter protein structures from Uniprot, PDBe and AlphaFold DB.
  • powerfit to fit protein structure in a Electron Microscopy (EM) density map.

An example workflow:

graph LR;
    search{Search UniprotKB} --> |uniprot_accessions|fetchpdbe{Retrieve PDBe}
    search{Search UniprotKB} --> |uniprot_accessions|fetchad{Retrieve AlphaFold}
    fetchpdbe -->|mmcif_files| filter{Filter structures}
    fetchad -->|mmcif_files| filter
    filter -->|mmcif_files| powerfit
    powerfit -->|*/solutions.out| solutions{Best scoring solutions}
    solutions -->|dataframe| fitmodels{Fit models}

Install

pip install protein-detective

Or to use the latest development version:

pip install git+https://github.com/haddocking/protein-detective.git

By default OpenCL support is included, but if you want to use CUDA, you can install with:

# For CUDA version 13
pip install "protein-detective[cuda13]"
# or for CUDA version 12
pip install "protein-detective[cuda12]"

Usage

The main entry point is the protein-detective command line tool which has multiple subcommands to perform actions.

To use programmaticly, see the notebooks and API documentation.

Search Uniprot for structures

protein-detective search \
    --taxon-id 9606 \
    --reviewed \
    --subcellular-location-uniprot nucleus \
    --subcellular-location-go GO:0005634 \
    --molecular-function-go GO:0003677 \
    --limit 100 \
    ./mysession

(GO:0005634 is "Nucleus" and GO:0003677 is "DNA binding")

In ./mysession directory, you will find session.db file, which is a DuckDB database with search results.

You can also include interaction partners in the search
protein-detective --log-level INFO search \
    --taxon-id 9606 \
    --reviewed \
    --subcellular-location-uniprot nucleus \
    --subcellular-location-go GO:0005634 \
    --molecular-function-go GO:0003677 \
    --interaction-partner-seed A8MT69 \
    --interaction-partner-exclude B1APH4 \
    --limit 100 \
    ./mysession2

Which will add Q96H22 which is an interaction partner of A8MT69 in a macromolecular complex.

To retrieve a bunch of structures

protein-detective retrieve ./mysession

In ./mysession directory, you will find mmCIF files from PDBe and PDB files and AlphaFold DB.

To filter structure

Filter structures based on

  • For PDBe structures the chain of Uniprot protein is written as chain A.
  • For AlphaFold structures filter by confidence (pLDDT) threshold
  • Number of residues in chain A
    • For AlphaFold structures writes new files with low confidence residues (below threshold) removed
  • Number of residues in secondary structure (helices and sheets)

Also uncompresses *.cif.gz files to *.cif files for compatibility with powerfit.

protein-detective --log-level INFO filter \
    --confidence-threshold 50 \
    --min-residues 100 \
    --max-residues 1000 \
    ./mysession

# or to filter only on secondary structure having some helices
protein-detective filter mysession --abs-min-helix-residues 40

Import filtered structures

If you have a directory of structures ((optionally gzipped) PDB/mmCIF files), each with a single chain called A and a single UniProt accession. You can import them into a new protein detective session with:

protein-detective import-structures ./mysession/filtered ./mysession3

Imported structures can be used to run powerfit.

Powerfit

Rotate and translate the prepared structures to fit and score them into the EM density map using powerfit.

protein-detective powerfit run ../powerfit-tutorial/ribosome-KsgA.map 13 ./mysession

This will use dask-distributed to run powerfit for each structure in parallel on multiple CPU cores or GPUs.

Run powerfits on Slurm

You can use dask-jobqueue to run the powerfits on a Slurm deployment on multiple machines on a shared filesystem.

In one terminal start the Dask cluster with

pip install dask-jobqueue
python3
from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(cores=8,
                       processes=4,
                       memory="16GB",
                       queue="normal")
print(cluster.scheduler_address)
# Prints something like: 'tcp://192.168.1.1:34059'
# Keep this Python process running until powerfits are done

In second terminal, run the powerfits on Dask cluster with

protein-detective powerfit run ../powerfit-tutorial/ribosome-KsgA.map 13 docs/session1 --scheduler-address tcp://192.168.1.1:34059
How to run efficiently

Powerfit is quickest on GPU, but can also run on CPU.

To run powerfits on a GPU you can use the --gpu <workers_per_gpu>. The value of workers_per_gpu should be high enough so the GPU is fully utilized. You can start with 1 (the default) and monitor the GPU usage with nvtop if you see that the GPU is not 100% loaded, you can increase the number until there are no more valleys in the GPU usage graph.

If you have multiple GPUs, then --gpu 2 will run powerfits on all GPUs and run 2 powerfits concurrently on each GPU.

If you do not use --gpu flag, then powerfit will run on CPU. By default each powerfit will use 1 CPU core and run multiple powerfits in parallel according to the number of physical CPU cores available on the machine (so excluding hyperthreaded cores).

You can set the --nproc <int> so each powerfit will use that many CPU cores. This is useful if you have more CPU cores available then there are structures to fit. If the number of structure to fit is greater than available CPU cores then using the default (1 core per powerfit) is recommended.

In testing on highend NVIDIA GPUs the OpenCL backend is faster than CUDA backend, so we default to using OpenCL. To use CUDA instead, you can set --gpu-backend cuda and make sure you installed protein-detective with the appropriate CUDA extra.

For example

protein-detective powerfit run --gpu 1 --batch-size 50 --gpu-backend cuda ../powerfit-tutorial/ribosome-KsgA.map 13 ./mysession
Alternativly run powerfit yourself

You can use the protein-detective powerfit commands to print the commands.

The commands can then be run in whatever way you prefer, like sequentially, with GNU parallel, or as a Slurm array job.

For example to run with parallel and 4 slots:

protein-detective powerfit commands ../powerfit-tutorial/ribosome-KsgA.map 13 docs/session1 > commands.txt
parallel --jobs 4 < commands.txt

To print top 10 solutions to the terminal, you can use:

protein-detective powerfit report docs/session1

Outputs something like:

powerfit_run_id,structure,rank,cc,fishz,relz,translation,rotation,pdb_id,pdb_file,uniprot_acc
10,A8MT69_pdb4e45.ent_B2A,1,0.432,0.463,10.091,227.18:242.53:211.83,0.0:1.0:1.0:0.0:0.0:1.0:1.0:0.0:0.0,4E45,docs/session1/single_chain/A8MT69_pdb4e45.ent_B2A.pdb,A8MT69
10,A8MT69_pdb4ne5.ent_B2A,1,0.423,0.452,10.053,227.18:242.53:214.9,0.0:-0.0:-0.0:-0.604:0.797:0.0:0.797:0.604:0.0,4NE5,docs/session1/single_chain/A8MT69_pdb4ne5.ent_B2A.pdb,A8MT69
...

To generate model PDB files rotated/translated to PowerFit solutions, you can use:

protein-detective powerfit fit-models docs/session1

Contributing

For development information and contribution guidelines, please see CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protein_detective-0.7.0.tar.gz (9.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protein_detective-0.7.0-py3-none-any.whl (45.0 kB view details)

Uploaded Python 3

File details

Details for the file protein_detective-0.7.0.tar.gz.

File metadata

  • Download URL: protein_detective-0.7.0.tar.gz
  • Upload date:
  • Size: 9.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for protein_detective-0.7.0.tar.gz
Algorithm Hash digest
SHA256 18730282a19275045cfc6b125ece1e40d5d18e1ce3f4afa2537b3ab700b0339b
MD5 077b46a1de504efb1b4e9cdb4f985f0c
BLAKE2b-256 a0d3c6d9217b4103ba34b22213e44beee8d4832c26167ad78e9b33d006be4d85

See more details on using hashes here.

Provenance

The following attestation bundles were made for protein_detective-0.7.0.tar.gz:

Publisher: pypi-publish.yml on haddocking/protein-detective

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file protein_detective-0.7.0-py3-none-any.whl.

File metadata

File hashes

Hashes for protein_detective-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 714eab0616960fedcee2bdcd12c83e0df98a655dc850891a2b21c2b78ceab0b9
MD5 eec37c4babf88aeffb9aa81dedac21ee
BLAKE2b-256 df279dfdcdd25a524d930cba19f6dc89854d2863697953f107b88631937fa73b

See more details on using hashes here.

Provenance

The following attestation bundles were made for protein_detective-0.7.0-py3-none-any.whl:

Publisher: pypi-publish.yml on haddocking/protein-detective

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page