Comprehensive Python Module for Protein Data Management: Designed for streamlined integration and processing of protein information from both UniProt and PDB. Equipped with features for concurrent data fetching, robust error handling, and database synchronization.
Project description
Protein Information System (PIS)
Protein Information System (PIS) is an integrated biological information system focused on extracting, processing, and managing protein-related data. PIS consolidates data from UniProt, PDB, and GOA, enabling the efficient retrieval and organization of protein sequences, structures, and functional annotations.
The primary goal of PIS is to provide a robust framework for large-scale protein data extraction, facilitating downstream functional analysis and annotation transfer. The system is designed for high-performance computing (HPC) environments, ensuring scalability and efficiency.
📈 Current State of the Project
FANTASIA: Functional Annotation Toolkit
🧠 FANTASIA was built on top of the Protein Information System (PIS) as an advanced tool for functional protein annotation using embeddings generated by protein language models.
The pipeline supports high-performance computing (HPC) environments and integrates tools such as ProtT5, ESM, and CD-HIT. These models can be extended or replaced with new variants without modifying the core software structure, simply by adding the new model to the PIS. This design enables scalable, modular, and reproducible GO term annotation from FASTA sequence files.
Protocol for Large-Scale Metamorphism and Multifunctionality Search
🔍 In addition, a systematic protocol has been developed for the large-scale identification of structural metamorphisms and protein multifunctionality.
🔗 Metamorphic and multifunctionality Search Repository
This protocol leverages the full capabilities of PIS to uncover non-obvious relationships between structure and function. Structural metamorphisms are detected by filtering large-scale structural alignments between proteins with high sequence identity, identifying divergent conformations. Multifunctionality is addressed through a semantic analysis of GO annotations, computing a functional distance metric to determine the two most divergent terms within each GO category per protein.
📡 Installing the BioData Lookup Table (Two Options)
This guide shows two ways to load and use the BioData lookup table:
- Option A - Manually download the PostgreSQL backup from Zenodo and restore it yourself (no PIS required).
- Option B - Clone the Protein Information System (PIS) repository and let its helper script set everything up.
Both options end with the same result: a PostgreSQL database called BioData running with the pgvector extension enabled.
📚 Prerequisites
- A machine with: - Docker installed and running. - At least ~25-30 GB of free disk space (the backup itself is large).
- PostgreSQL client tools installed on your host:
-
psql,createdb,dropdb,pg_restore- Recommended: PostgreSQL 16+ client tools. - Credentials used in this guide:
- PostgreSQL user:
usuario- PostgreSQL password:clave- Database name:BioData
Adjust credentials if you use different ones.
Option A - Manual Setup from Zenodo (without PIS)
1. Start the pgvector PostgreSQL container
docker run -d --name pgvectorsql \
-e POSTGRES_USER=usuario \
-e POSTGRES_PASSWORD=clave \
-e POSTGRES_DB=BioData \
-p 5432:5432 \
pgvector/pgvector:pg16
This starts PostgreSQL with pgvector on localhost:5432.
2. Download the BioData backup from Zenodo
- Open the Zenodo record in your browser, for example:
- Final-layer table:
https://zenodo.org/records/17795871- Early+final layers table:https://zenodo.org/records/17793273 - In the Files section, locate the
.backupfile you want, e.g.: -BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layer0.backup- or -BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup - Click Download and save the file to a known location, for example:
~/biodata_backups/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup
Do this via the browser to avoid Zenodo's cookie/redirect issues. The file should be multi-GB in size, not a few KB.
3. Drop and recreate the BioData database
On your host, using the PostgreSQL client tools (connecting to the Docker container):
export PGPASSWORD="clave"
# 1) Try to drop the database if it exists
dropdb -h localhost -U usuario BioData --if-exists
# 2) If there are still active connections, terminate them
psql -h localhost -U usuario -d postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'BioData' AND pid <> pg_backend_pid();"
dropdb -h localhost -U usuario BioData --if-exists
# 3) Final termination attempt (if needed) and drop
psql -h localhost -U usuario -d postgres \
-c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'BioData';"
sleep 2
dropdb -h localhost -U usuario BioData --if-exists
# 4) Recreate BioData
createdb -h localhost -U usuario BioData
4. Enable pgvector extension
psql -h localhost -U usuario -d BioData \
-c "CREATE EXTENSION IF NOT EXISTS vector;"
5. Restore the BioData backup
export PGPASSWORD="clave"
pg_restore -h localhost -U usuario \
-d BioData \
~/biodata_backups/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup
If restore succeeds, you now have the BioData database ready to use.
6. Connecting to BioData
- Using
psql:
PGPASSWORD="clave" psql -h localhost -U usuario -d BioData
- Typical connection URL for applications:
postgresql://usuario:clave@localhost:5432/BioData
Use this string in your tools, notebook, or pipeline that needs to query the lookup table.
Option B - Using the PIS Repository and Helper Script
If you also want the Protein Information System (PIS) and its automation around the database, use this method.
1. Clone the repository
cd /path/where/you/want/the/repo
git clone https://github.com/CBBIO/protein-information-system.git
cd protein-information-system
2. Set the Zenodo URL in pis_launcher_script.sh
At the top of pis_launcher_script.sh, set:
ZENODO_URL="https://zenodo.org/records/17793273/files/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup?download=1"
(or the URL of the specific .backup you want from the Files section.)
The script will:
- Derive the filename from this URL.
- Download to the configured backup folder if it does not exist.
- Reuse the local file on subsequent runs (no re-download).
3. Run the self-check with rebase from Zenodo
From the repository root:
bash pis_launcher_script.sh --rebase-from-zenodo
This script will:
- Check that Docker is running.
- Ensure the
pgvectorsqlcontainer (PostgreSQL + pgvector) andrabbitmqcontainer exist and are running. - Download the BioData backup from Zenodo (or reuse the existing file in the configured backup folder).
- Drop and recreate the
BioDatadatabase onlocalhost:5432. - Enable the
vectorextension. - Run
pg_restorefrom the downloaded backup.
If the size check fails (file looks too small), it will stop and tell you to correct ZENODO_URL or manually download the backup into the configured backup folder.
With
--rebase-from-zenodo, the script focuses on the DB rebase and then exits, so you get a clean BioData database ready to use.
Script Options
Common flags for pis_launcher_script.sh:
--rebase-from-zenodo: Download (or reuse) the Zenodo backup and restore it.--rebase-from-backup: Restore from a local backup file.--zenodo-url=...: Override the Zenodo URL used for download.--backup-folder=...: Folder where backups are stored/loaded.--backup-file-name=...: Backup filename to use inside the backup folder.--database-name=...: Target database name (default:BioData).--check-servicesor--check-services-only: Only check Docker and container status without a restore.
4. Use the database
After the script completes successfully:
- Connect with
psqlas in Option A:
PGPASSWORD="clave" psql -h localhost -U usuario -d BioData
- Or point your applications to:
postgresql://usuario:clave@localhost:5432/BioData
PIS itself can then use this database for its embedding and lookup workflows.
If you want, I can also draft a short "Troubleshooting" section for Notion (e.g. pg_restore version issues, port conflicts on 5432, etc.).
Get started:
To execute the full extraction process, install dependencies and run from project root:
pis
This command will trigger the complete workflow, starting from the initial data preprocessing stages and continuing through to the final data organization and storage.
Customizing the Workflow:
You can customize the sequence of tasks executed by modifying main.py or adjusting the relevant parameters in the config.yaml file. This allows you to tailor the extraction process to meet specific research needs or to experiment with different data processing configurations.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file protein_information_system-3.1.2.tar.gz.
File metadata
- Download URL: protein_information_system-3.1.2.tar.gz
- Upload date:
- Size: 55.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.0 CPython/3.10.19 Linux/6.8.0-1044-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
104247fdf8f78c5e84942114726b75d6cc94bf873f910d727a963e253c6320de
|
|
| MD5 |
dc5bbbf3eabc4cb87cb7ccfb4afbf48d
|
|
| BLAKE2b-256 |
d9de12af67605ee1142d0c75144849dc51a382b622cb4bcdba6fee1d50f9716d
|
File details
Details for the file protein_information_system-3.1.2-py3-none-any.whl.
File metadata
- Download URL: protein_information_system-3.1.2-py3-none-any.whl
- Upload date:
- Size: 80.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.0 CPython/3.10.19 Linux/6.8.0-1044-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d86fd61a1b59ad5eaeda1da7c4cc86c6b0800af2c59ee690d5745998c3314683
|
|
| MD5 |
ea1541adcce8a13f2d7424510fd97bd7
|
|
| BLAKE2b-256 |
3434adcbb4f7ab230dd9da1caabeb38f6576021f18d5f85952f58c49d2629ab5
|