Skip to main content

Comprehensive Python Module for Protein Data Management: Designed for streamlined integration and processing of protein information from both UniProt and PDB. Equipped with features for concurrent data fetching, robust error handling, and database synchronization.

Project description

PyPI - Version Documentation Status Linting Status codecov

Protein Information System (PIS)

Protein Information System (PIS) is an integrated biological information system focused on extracting, processing, and managing protein-related data. PIS consolidates data from UniProt, PDB, and GOA, enabling the efficient retrieval and organization of protein sequences, structures, and functional annotations.

The primary goal of PIS is to provide a robust framework for large-scale protein data extraction, facilitating downstream functional analysis and annotation transfer. The system is designed for high-performance computing (HPC) environments, ensuring scalability and efficiency.

📈 Current State of the Project

FANTASIA: Functional Annotation Toolkit

🧠 FANTASIA was built on top of the Protein Information System (PIS) as an advanced tool for functional protein annotation using embeddings generated by protein language models.

🔗 FANTASIA Repository

The pipeline supports high-performance computing (HPC) environments and integrates tools such as ProtT5, ESM, and CD-HIT. These models can be extended or replaced with new variants without modifying the core software structure, simply by adding the new model to the PIS. This design enables scalable, modular, and reproducible GO term annotation from FASTA sequence files.

Protocol for Large-Scale Metamorphism and Multifunctionality Search

🔍 In addition, a systematic protocol has been developed for the large-scale identification of structural metamorphisms and protein multifunctionality.

🔗 Metamorphic and multifunctionality Search Repository

This protocol leverages the full capabilities of PIS to uncover non-obvious relationships between structure and function. Structural metamorphisms are detected by filtering large-scale structural alignments between proteins with high sequence identity, identifying divergent conformations. Multifunctionality is addressed through a semantic analysis of GO annotations, computing a functional distance metric to determine the two most divergent terms within each GO category per protein.


📡 Installing the BioData Lookup Table (Two Options)

This guide shows two ways to load and use the BioData lookup table:

  1. Option A - Manually download the PostgreSQL backup from Zenodo and restore it yourself (no PIS required).
  2. Option B - Clone the Protein Information System (PIS) repository and let its helper script set everything up.

Both options end with the same result: a PostgreSQL database called BioData running with the pgvector extension enabled.


📚 Prerequisites

  • A machine with: - Docker installed and running. - At least ~25-30 GB of free disk space (the backup itself is large).
  • PostgreSQL client tools installed on your host: - psql, createdb, dropdb, pg_restore - Recommended: PostgreSQL 16+ client tools.
  • Credentials used in this guide: - PostgreSQL user: usuario - PostgreSQL password: clave - Database name: BioData

Adjust credentials if you use different ones.


Option A - Manual Setup from Zenodo (without PIS)

1. Start the pgvector PostgreSQL container

docker run -d --name pgvectorsql \
    -e POSTGRES_USER=usuario \
    -e POSTGRES_PASSWORD=clave \
    -e POSTGRES_DB=BioData \
    -p 5432:5432 \
    pgvector/pgvector:pg16

This starts PostgreSQL with pgvector on localhost:5432.


2. Download the BioData backup from Zenodo

  1. Open the Zenodo record in your browser, for example: - Final-layer table: https://zenodo.org/records/17795871 - Early+final layers table: https://zenodo.org/records/17793273
  2. In the Files section, locate the .backup file you want, e.g.: - BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layer0.backup - or - BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup
  3. Click Download and save the file to a known location, for example:
~/biodata_backups/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup

Do this via the browser to avoid Zenodo's cookie/redirect issues. The file should be multi-GB in size, not a few KB.


3. Drop and recreate the BioData database

On your host, using the PostgreSQL client tools (connecting to the Docker container):

export PGPASSWORD="clave"

# 1) Try to drop the database if it exists
dropdb -h localhost -U usuario BioData --if-exists

# 2) If there are still active connections, terminate them
psql -h localhost -U usuario -d postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'BioData' AND pid <> pg_backend_pid();"

dropdb -h localhost -U usuario BioData --if-exists

# 3) Final termination attempt (if needed) and drop
psql -h localhost -U usuario -d postgres \
    -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'BioData';"

sleep 2
dropdb -h localhost -U usuario BioData --if-exists

# 4) Recreate BioData
createdb -h localhost -U usuario BioData

4. Enable pgvector extension

psql -h localhost -U usuario -d BioData \
    -c "CREATE EXTENSION IF NOT EXISTS vector;"

5. Restore the BioData backup

export PGPASSWORD="clave"

pg_restore -h localhost -U usuario \
    -d BioData \
    ~/biodata_backups/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup

If restore succeeds, you now have the BioData database ready to use.


6. Connecting to BioData

  • Using psql:
PGPASSWORD="clave" psql -h localhost -U usuario -d BioData
  • Typical connection URL for applications:
postgresql://usuario:clave@localhost:5432/BioData

Use this string in your tools, notebook, or pipeline that needs to query the lookup table.


Option B - Using the PIS Repository and Helper Script

If you also want the Protein Information System (PIS) and its automation around the database, use this method.

1. Clone the repository

cd /path/where/you/want/the/repo
git clone https://github.com/CBBIO/protein-information-system.git
cd protein-information-system

2. Set the Zenodo URL in pis_launcher_script.sh

At the top of pis_launcher_script.sh, set:

ZENODO_URL="https://zenodo.org/records/17793273/files/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup?download=1"

(or the URL of the specific .backup you want from the Files section.)

The script will:

  • Derive the filename from this URL.
  • Download to the configured backup folder if it does not exist.
  • Reuse the local file on subsequent runs (no re-download).

3. Run the self-check with rebase from Zenodo

From the repository root:

bash pis_launcher_script.sh --rebase-from-zenodo

This script will:

  1. Check that Docker is running.
  2. Ensure the pgvectorsql container (PostgreSQL + pgvector) and rabbitmq container exist and are running.
  3. Download the BioData backup from Zenodo (or reuse the existing file in the configured backup folder).
  4. Drop and recreate the BioData database on localhost:5432.
  5. Enable the vector extension.
  6. Run pg_restore from the downloaded backup.

If the size check fails (file looks too small), it will stop and tell you to correct ZENODO_URL or manually download the backup into the configured backup folder.

With --rebase-from-zenodo, the script focuses on the DB rebase and then exits, so you get a clean BioData database ready to use.


Script Options

Common flags for pis_launcher_script.sh:

  • --rebase-from-zenodo: Download (or reuse) the Zenodo backup and restore it.
  • --rebase-from-backup: Restore from a local backup file.
  • --zenodo-url=...: Override the Zenodo URL used for download.
  • --backup-folder=...: Folder where backups are stored/loaded.
  • --backup-file-name=...: Backup filename to use inside the backup folder.
  • --database-name=...: Target database name (default: BioData).
  • --check-services or --check-services-only: Only check Docker and container status without a restore.

4. Use the database

After the script completes successfully:

  • Connect with psql as in Option A:
PGPASSWORD="clave" psql -h localhost -U usuario -d BioData
  • Or point your applications to:
postgresql://usuario:clave@localhost:5432/BioData

PIS itself can then use this database for its embedding and lookup workflows.


If you want, I can also draft a short "Troubleshooting" section for Notion (e.g. pg_restore version issues, port conflicts on 5432, etc.).


Get started:

To execute the full extraction process, install dependencies and run from project root:

pis

This command will trigger the complete workflow, starting from the initial data preprocessing stages and continuing through to the final data organization and storage.

Customizing the Workflow:

You can customize the sequence of tasks executed by modifying main.py or adjusting the relevant parameters in the config.yaml file. This allows you to tailor the extraction process to meet specific research needs or to experiment with different data processing configurations.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protein_information_system-3.1.1.tar.gz (55.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protein_information_system-3.1.1-py3-none-any.whl (80.8 kB view details)

Uploaded Python 3

File details

Details for the file protein_information_system-3.1.1.tar.gz.

File metadata

  • Download URL: protein_information_system-3.1.1.tar.gz
  • Upload date:
  • Size: 55.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.10.19 Linux/6.8.0-1044-azure

File hashes

Hashes for protein_information_system-3.1.1.tar.gz
Algorithm Hash digest
SHA256 4ffc8df154eb4c0c5bec6a59178aaf03ba9a1c23b2db42085c13f64e0d49b2b8
MD5 1583ec6b0c8b627b8bbaa3a70aa39133
BLAKE2b-256 db41683bb1dd991d29a28b3e063ac25953052fbd8080981a8c0a05276ec4449b

See more details on using hashes here.

File details

Details for the file protein_information_system-3.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for protein_information_system-3.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1ba9aa6c7df36796944354c39e9b05e80fa26ea67daa9af0da508df12f969ae7
MD5 3c2c22290740558b594436015875fa3c
BLAKE2b-256 1b2412524baefd0a0a2a6809e891ede90d7217fac9cfd246f3c6a1f5e5935365

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page