Comprehensive Python Module for Protein Data Management: Designed for streamlined integration and processing of protein information from both UniProt and PDB. Equipped with features for concurrent data fetching, robust error handling, and database synchronization.

These details have not been verified by PyPI

Project description

Linting Status

Protein Information System (PIS)

Protein Information System (PIS) is an integrated biological information system focused on extracting, processing, and managing protein-related data. PIS consolidates data from UniProt, PDB, and GOA, enabling the efficient retrieval and organization of protein sequences, structures, and functional annotations.

The primary goal of PIS is to provide a robust framework for large-scale protein data extraction, facilitating downstream functional analysis and annotation transfer. The system is designed for high-performance computing (HPC) environments, ensuring scalability and efficiency.

📈 Current State of the Project

FANTASIA: Functional Annotation Toolkit

🧠 FANTASIA was built on top of the Protein Information System (PIS) as an advanced tool for functional protein annotation using embeddings generated by protein language models.

🔗 FANTASIA Repository

The pipeline supports high-performance computing (HPC) environments and integrates tools such as ProtT5, ESM, and CD-HIT. These models can be extended or replaced with new variants without modifying the core software structure, simply by adding the new model to the PIS. This design enables scalable, modular, and reproducible GO term annotation from FASTA sequence files.

Protocol for Large-Scale Metamorphism and Multifunctionality Search

🔍 In addition, a systematic protocol has been developed for the large-scale identification of structural metamorphisms and protein multifunctionality.

🔗 Metamorphic and multifunctionality Search Repository

This protocol leverages the full capabilities of PIS to uncover non-obvious relationships between structure and function. Structural metamorphisms are detected by filtering large-scale structural alignments between proteins with high sequence identity, identifying divergent conformations. Multifunctionality is addressed through a semantic analysis of GO annotations, computing a functional distance metric to determine the two most divergent terms within each GO category per protein.

📡 Installing the BioData Lookup Table (Two Options)

This guide shows two ways to load and use the BioData lookup table:

Option A - Manually download the PostgreSQL backup from Zenodo and restore it yourself (no PIS required).
Option B - Clone the Protein Information System (PIS) repository and let its helper script set everything up.

Both options end with the same result: a PostgreSQL database called BioData running with the pgvector extension enabled.

📚 Prerequisites

A machine with: - Docker installed and running. - At least ~25-30 GB of free disk space (the backup itself is large).
PostgreSQL client tools installed on your host: - psql, createdb, dropdb, pg_restore - Recommended: PostgreSQL 16+ client tools.
Credentials used in this guide: - PostgreSQL user: usuario - PostgreSQL password: clave - Database name: BioData

Adjust credentials if you use different ones.

Option A - Manual Setup from Zenodo (without PIS)

1. Start the pgvector PostgreSQL container

docker run -d --name pgvectorsql \
    -e POSTGRES_USER=usuario \
    -e POSTGRES_PASSWORD=clave \
    -e POSTGRES_DB=BioData \
    -p 5432:5432 \
    pgvector/pgvector:pg16

This starts PostgreSQL with pgvector on localhost:5432.

2. Download the BioData backup from Zenodo

Open the Zenodo record in your browser, for example: - Final-layer table: https://zenodo.org/records/17795871 - Early+final layers table: https://zenodo.org/records/17793273
In the Files section, locate the .backup file you want, e.g.: - BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layer0.backup - or - BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup
Click Download and save the file to a known location, for example:

~/biodata_backups/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup

Do this via the browser to avoid Zenodo's cookie/redirect issues. The file should be multi-GB in size, not a few KB.

3. Drop and recreate the BioData database

On your host, using the PostgreSQL client tools (connecting to the Docker container):

export PGPASSWORD="clave"

# 1) Try to drop the database if it exists
dropdb -h localhost -U usuario BioData --if-exists

# 2) If there are still active connections, terminate them
psql -h localhost -U usuario -d postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'BioData' AND pid <> pg_backend_pid();"

dropdb -h localhost -U usuario BioData --if-exists

# 3) Final termination attempt (if needed) and drop
psql -h localhost -U usuario -d postgres \
    -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'BioData';"

sleep 2
dropdb -h localhost -U usuario BioData --if-exists

# 4) Recreate BioData
createdb -h localhost -U usuario BioData

4. Enable pgvector extension

psql -h localhost -U usuario -d BioData \
    -c "CREATE EXTENSION IF NOT EXISTS vector;"

5. Restore the BioData backup

export PGPASSWORD="clave"

pg_restore -h localhost -U usuario \
    -d BioData \
    ~/biodata_backups/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup

If restore succeeds, you now have the BioData database ready to use.

6. Connecting to BioData

Using psql:

PGPASSWORD="clave" psql -h localhost -U usuario -d BioData

Typical connection URL for applications:

postgresql://usuario:clave@localhost:5432/BioData

Use this string in your tools, notebook, or pipeline that needs to query the lookup table.

Option B - Using the PIS Repository and Helper Script

If you also want the Protein Information System (PIS) and its automation around the database, use this method.

1. Clone the repository

cd /path/where/you/want/the/repo
git clone https://github.com/CBBIO/protein-information-system.git
cd protein-information-system

2. Set the Zenodo URL in pis_launcher_script.sh

At the top of pis_launcher_script.sh, set:

ZENODO_URL="https://zenodo.org/records/17793273/files/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup?download=1"

(or the URL of the specific .backup you want from the Files section.)

The script will:

Derive the filename from this URL.
Download to the configured backup folder if it does not exist.
Reuse the local file on subsequent runs (no re-download).

3. Run the self-check with rebase from Zenodo

From the repository root:

bash pis_launcher_script.sh --rebase-from-zenodo

This script will:

Check that Docker is running.
Ensure the pgvectorsql container (PostgreSQL + pgvector) and rabbitmq container exist and are running.
Download the BioData backup from Zenodo (or reuse the existing file in the configured backup folder).
Drop and recreate the BioData database on localhost:5432.
Enable the vector extension.
Run pg_restore from the downloaded backup.

If the size check fails (file looks too small), it will stop and tell you to correct ZENODO_URL or manually download the backup into the configured backup folder.

With --rebase-from-zenodo, the script focuses on the DB rebase and then exits, so you get a clean BioData database ready to use.

Script Options

Common flags for pis_launcher_script.sh:

--rebase-from-zenodo: Download (or reuse) the Zenodo backup and restore it.
--rebase-from-backup: Restore from a local backup file.
--zenodo-url=...: Override the Zenodo URL used for download.
--backup-folder=...: Folder where backups are stored/loaded.
--backup-file-name=...: Backup filename to use inside the backup folder.
--database-name=...: Target database name (default: BioData).
--check-services or --check-services-only: Only check Docker and container status without a restore.

4. Use the database

After the script completes successfully:

Connect with psql as in Option A:

PGPASSWORD="clave" psql -h localhost -U usuario -d BioData

Or point your applications to:

postgresql://usuario:clave@localhost:5432/BioData

PIS itself can then use this database for its embedding and lookup workflows.

If you want, I can also draft a short "Troubleshooting" section for Notion (e.g. pg_restore version issues, port conflicts on 5432, etc.).

Get started:

To execute the full extraction process, install dependencies and run from project root:

pis

This command will trigger the complete workflow, starting from the initial data preprocessing stages and continuing through to the final data organization and storage.

Customizing the Workflow:

You can customize the sequence of tasks executed by modifying main.py or adjusting the relevant parameters in the config.yaml file. This allows you to tailor the extraction process to meet specific research needs or to experiment with different data processing configurations.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

3.1.2

Feb 10, 2026

This version

3.1.1

Feb 8, 2026

3.1.0

Nov 24, 2025

3.0.0

Sep 24, 2025

2.0.0

Jul 28, 2025

1.5.1

Jul 15, 2025

1.5.0

Jul 15, 2025

1.4.0

Jul 15, 2025

1.3.0

Jun 20, 2025

1.2.0

Jun 2, 2025

1.1.0

Jun 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protein_information_system-3.1.1.tar.gz (55.9 kB view details)

Uploaded Feb 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

protein_information_system-3.1.1-py3-none-any.whl (80.8 kB view details)

Uploaded Feb 8, 2026 Python 3

File details

Details for the file protein_information_system-3.1.1.tar.gz.

File metadata

Download URL: protein_information_system-3.1.1.tar.gz
Upload date: Feb 8, 2026
Size: 55.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.0 CPython/3.10.19 Linux/6.8.0-1044-azure

File hashes

Hashes for protein_information_system-3.1.1.tar.gz
Algorithm	Hash digest
SHA256	`4ffc8df154eb4c0c5bec6a59178aaf03ba9a1c23b2db42085c13f64e0d49b2b8`
MD5	`1583ec6b0c8b627b8bbaa3a70aa39133`
BLAKE2b-256	`db41683bb1dd991d29a28b3e063ac25953052fbd8080981a8c0a05276ec4449b`

See more details on using hashes here.

File details

Details for the file protein_information_system-3.1.1-py3-none-any.whl.

File metadata

Download URL: protein_information_system-3.1.1-py3-none-any.whl
Upload date: Feb 8, 2026
Size: 80.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.0 CPython/3.10.19 Linux/6.8.0-1044-azure

File hashes

Hashes for protein_information_system-3.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ba9aa6c7df36796944354c39e9b05e80fa26ea67daa9af0da508df12f969ae7`
MD5	`3c2c22290740558b594436015875fa3c`
BLAKE2b-256	`1b2412524baefd0a0a2a6809e891ede90d7217fac9cfd246f3c6a1f5e5935365`

See more details on using hashes here.

protein-information-system 3.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Protein Information System (PIS)

📈 Current State of the Project

FANTASIA: Functional Annotation Toolkit

Protocol for Large-Scale Metamorphism and Multifunctionality Search

📡 Installing the BioData Lookup Table (Two Options)

📚 Prerequisites

Option A - Manual Setup from Zenodo (without PIS)

1. Start the pgvector PostgreSQL container

2. Download the BioData backup from Zenodo

3. Drop and recreate the BioData database

4. Enable pgvector extension

5. Restore the BioData backup

6. Connecting to BioData

Option B - Using the PIS Repository and Helper Script

1. Clone the repository

2. Set the Zenodo URL in pis_launcher_script.sh

3. Run the self-check with rebase from Zenodo

Script Options

4. Use the database

Get started:

Customizing the Workflow:

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes