Skip to main content

Python package to run SIFTS from EMBL-EBI locally.

Project description

pdbe_sifts

Python package to run the SIFTS (Structure Integration with Function, Taxonomy and Sequences) pipeline locally.

Developed at EMBL-EBI by the PDBe team.


What is SIFTS?

SIFTS provides residue-level mappings between structures and sequences. This package automates the full pipeline:

  1. Build a reference sequence database (MMseqs2 or BLASTP)
  2. Align structure sequences against it to identify the best match per chain (≥ 90% identity) according to the SIFTS scoring function
  3. Generate precise residue- and segment-level structure-sequence mappings via local alignment (FASTA36 lalign36)
  4. Store results in a DuckDB database and per-entry CSV files
  5. Export mappings back into annotated mmCIF files

The whole pipeline can work on non-UniProt or non-PDB entries. However, it will use only the adjusted score to rank the hits.


Installation

System dependencies

The following binaries must be installed and available on PATH:

Tool Purpose Install
MMseqs2 Fast global sequence search conda install -c conda-forge mmseqs2
FASTA36 (lalign36) Local pairwise alignment conda install -c bioconda fasta3
BLAST+ Optional alternative to MMseqs2 conda install -c bioconda blast

A. Install using micromamba (recommended)

# Create environment from file
micromamba env create -f environment.yml

# Activate environment
micromamba activate pdbe_sifts

# Install pdbe_sifts package in editable mode
pip install -e .

# Or install directly
pip install pdbe_sifts

B. Install using uv (fast alternative if only pdbe_sifts python package is needed)

1. Install uv (if not already installed)

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Alternative via pip
pip install uv

2. Clone the repository

git clone https://github.com/PDBeurope/SIFTS
cd SIFTS

3. Create virtual environment and install dependencies

# Create a virtual environment and install all dependencies
uv sync

# This will:
# - Create a .venv directory
# - Install Python 3.10 if needed
# - Install all dependencies from pyproject.toml
# - Lock versions in uv.lock

4. Activate the virtual environment

# macOS/Linux
source .venv/bin/activate

# Windows
.venv\Scripts\activate

Requirements: Python ≥ 3.10 · 16 GB RAM minimum (32 GB+ recommended for large datasets)


Quick Start

1 — Initialise your config

pdbe_sifts init
# → creates ~/.config/pdbe_sifts/config.yaml
# → downloads the NCBI taxonomy database (~70 MB, first run only)

Edit the config to set your paths (base_dir, nobackup_dir, target_db (after building it), etc.). You can also setup different alignment parameters.

2 — Build a reference database

pdbe_sifts build_db \
  -i uniprot_sprot.fasta \
  -o ./my_db \
  -t taxonomy_mapping.tsv   # TSV: sequence_id <tab> tax_id

3 — Run structure to sequence matching

# Single CIF entry
pdbe_sifts sequence_match -i 1abc.cif -o ./results -d ./my_db/target_db

# Batch (one mmCIF path per line)
pdbe_sifts sequence_match -i entries.txt -o ./results -d ./my_db/target_db --threads 8

At this step you can also provide a .csv file to faster the scoring function. This CSV file must contains per row: row_num, uniprot_accession, dataset (Swiss-Prot or TrEMBL), pdb cross-references, annotation score.

Produces hits.duckdb and hits.tsv — a scored and raw table of sequence candidates per structure entity.

4 — Generate SIFTS segments and residue mappings

# With DuckDB hits (from structure to sequence matching step)
pdbe_sifts segments -i 1abc.cif.gz -o ./segments -d hits.duckdb

# Manual structure-sequence mapping (chain:accession)
pdbe_sifts segments -i 1abc.cif.gz -o ./segments -m "A:P00963,B:P00963"

# Custom FASTA mapping (headers: >{structure_id}|{auth_asym_id}|{sequence_id})
pdbe_sifts segments -i 1abc.cif.gz -o ./segments -m custom_seqs.fasta

Produces per-entry gzip-compressed CSV files under {output_dir}/.

5 — Load segment data into DuckDB

pdbe_sifts db_load -i ./segments/ -d hits.duckdb

Bulk-loads the segment and residue CSVs produced in step 4 into the sifts_xref_segment and sifts_xref_residue tables of the DuckDB file.

6 — Annotate mmCIF files with residue level mappings and SIFTS data

# Reading from DuckDB (after step 5)
pdbe_sifts sifts2mmcif \
  -i 1abc.cif.gz \
  -o ./sifts_mmcif \
  -d hits.duckdb

# Or reading segment CSVs directly (skip step 5)
pdbe_sifts sifts2mmcif \
  -i 1abc.cif.gz \
  -o ./sifts_mmcif \
  -s ./segments/

CLI Reference

Command Description
pdbe_sifts init Copy default config to ~/.config/pdbe_sifts/config.yaml and init NCBI taxonomy DB
pdbe_sifts show Print the fully resolved configuration
pdbe_sifts update_ncbi Force-update the local NCBI taxonomy database (ete4)
pdbe_sifts build_db Build a reference sequence database (MMseqs2 or BLASTP) from a FASTA file
pdbe_sifts fasta_build Extract entity sequences from mmCIF files and write a FASTA
pdbe_sifts sequence_match Align structure sequences against the reference DB; score and store hits in DuckDB
pdbe_sifts segments Generate SIFTS mappings for a single mmCIF entry
pdbe_sifts db_load Bulk-load segment/residue CSVs from segments generation into DuckDB
pdbe_sifts sifts2mmcif Inject SIFTS mappings into an annotated mmCIF file
pdbe_sifts update_ccd_mapping Check whether the remote CCD file is newer than the cached three-to-one letter mapping CSV and regenerate it if so
pdbe_sifts seq2seq Align canonical deposited sequence vs coordinate sequence

Useful Classes

The pipeline classes can be used directly in Python scripts without going through the CLI.

TargetDb — Build a reference sequence database

from pdbe_sifts.sequence_match.target_database import TargetDb

TargetDb(
    input_path="uniprot_sprot.fasta",
    output_path="./my_db/target_db",
    tax_mapping_file="taxonomy.tsv",
    tool="mmseqs",   # or "blastp"
    threads=8,
).run()

FastaBuilder — Extract sequences from mmCIF files

from pdbe_sifts.sifts_fasta_builder import FastaBuilder

fasta_path = FastaBuilder(
    input_path="1abc.cif",   # or .cif.gz, or a .txt file listing CIF paths
    out_dir="./fasta/",
    threads=4,
).build()

SiftsSequenceMatch — Run the alignment and scoring pipeline

from pdbe_sifts.sifts_sequence_match import SiftsSequenceMatch

SiftsSequenceMatch(
    input_file="1abc.cif",   # or .fasta, or a .txt list of CIF paths
    out_dir="./results/",
    db_file="./my_db/target_db",
    tool="mmseqs",           # or "blastp"
    threads=8,
).process()
# → writes hits.duckdb and hits_<entry>.tsv to out_dir

SiftsAlign — Generate per-entry segment and residue mappings

from pdbe_sifts.sifts_segments_generation import SiftsAlign

# Mode 1: use scored hits from sequence_match
sa = SiftsAlign(
    cif_file="1abc.cif",
    out_dir="./segments/",
    db_conn_str="hits.duckdb",
)

# Mode 2: provide a manual mapping (accessions or custom FASTA)
sa = SiftsAlign(
    cif_file="1abc.cif",
    out_dir="./segments/",
    unp_mode="A:P00963,B:P00963",   # or path to a FASTA file
)

sa.process_entry("1abc")
if sa.conn:
    sa.conn.close()
# → writes {out_dir}/1abc_seg.csv.gz
#           {out_dir}/1abc_res.csv.gz

SiftsDB — Bulk-load segment CSVs into DuckDB

import duckdb
from pdbe_sifts.database.sifts_db_wrapper import SiftsDB

conn = duckdb.connect("hits.duckdb")
SiftsDB(conn).bulk_load_from_entries("./segments/")
conn.close()

Outputs

Global mappings

File Format Content
hits.duckdb DuckDB Scored sequence accession candidates per structure entity
hits_<entry>.tsv TSV Raw MMseqs2 / BLASTP alignment hits

Segment generation

Per entry, under {output_dir}:

File Format Content
{entry}_seg.csv.gz CSV (gzip) One row per contiguous aligned range (structure ↔ sequence positions, identity, conflicts, chimera flag)
{entry}_res.csv.gz CSV (gzip) One row per mapped structure residue (auth seq id, sequence position, one-letter codes, observed flag)
{entry}_nf90_seg.csv.gz CSV (gzip) NF90 variant of the segment file (written when applicable)

After running db_load, results are available in DuckDB tables sifts_xref_segment and sifts_xref_residue.


Environment Variables

Variable Default Description
SIFTS_LOG_LEVEL INFO Logging verbosity: DEBUG, INFO, WARNING, ERROR, CRITICAL
SIFTS_N_PROC auto Number of internal threads per worker (lalign36 jobs). Override manually to cap CPU use.
SIFTS_NO_CACHE_ALL unset If set (any value), disables the UniProt pickle cache and always fetches from the REST API.
SLURM_CPUS_PER_TASK unset Detected automatically on SLURM clusters. Used by get_allocated_cpus() to set the thread count when running under a SLURM job allocation.

Project Structure

src/pdbe_sifts/
├── cli.py                         # CLI entry point (pdbe_sifts command)
├── sifts_sequence_match.py       # Global mapping pipeline (SiftsSequenceMatch)
├── sifts_segments_generation.py   # Single-entry segment generation (SiftsAlign)
├── sifts_fasta_builder.py         # Extract sequences from mmCIF → FASTA (FastaBuilder)
├── sifts_database_loader.py       # Standalone bulk-loader script (wraps SiftsDB)
├── config/                        # OmegaConf configuration loading — defines load_config()
├── base/
│   ├── paths.py                   # All configuration getters (imports load_config() from config/)
│   ├── utils.py                   # UniProt fetch, CPU helpers, SiftsAction
│   ├── log.py                     # Logging setup (StreamHandler, coloredlogs)
│   └── exceptions.py              # All custom exceptions (centralised)
├── database/
│   └── sifts_db_wrapper.py        # SiftsDB: DuckDB schema + bulk loader
├── mmcif/                         # mmCIF parsing (Entry, Chain, Entity, Residue, ChemComp)
├── sequence_match/
│   ├── target_database.py         # Build MMseqs2 / BLAST reference database (TargetDb)
│   ├── mmseqs_search.py           # MMseqs2 easy-search wrapper
│   ├── blastp.py                  # BLASTP wrapper
│   └── sequence_match_parser.py  # Parse TSV hits, score, store in DuckDB
├── segments_generation/
│   └── alignment/                 # lalign36 wrapper, isoform alignment, residue mapping
├── sifts_to_mmcif/                # Inject SIFTS data back into mmCIF files
├── unp/
│   └── unp.py                     # UniProt REST client, pickle cache, isoform handling
└── data/
    └── default_config.yaml        # Default configuration template (all tuneable params)

Authors

EMBL-EBI PDBe team: Adam Bellaiche, Preeti Choudhary, Sreenath Sasidharan Nair, Jennifer Fleming, Sameer Velankar

License

Apache-2.0

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdbe_sifts-1.0.tar.gz (39.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdbe_sifts-1.0-py3-none-any.whl (39.5 MB view details)

Uploaded Python 3

File details

Details for the file pdbe_sifts-1.0.tar.gz.

File metadata

  • Download URL: pdbe_sifts-1.0.tar.gz
  • Upload date:
  • Size: 39.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdbe_sifts-1.0.tar.gz
Algorithm Hash digest
SHA256 b47e000bbfd7f3a9894a089d8be2679942ac66ddd01b47a83a100cf07e7cc3d3
MD5 8820ac4c4b6eec2922ba79468056ec75
BLAKE2b-256 ba39dd237b357ae650fa2196c4dd7c2b87d2d03c91c1261f4aabd3c33d9d9a7a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdbe_sifts-1.0.tar.gz:

Publisher: python-publish.yml on PDBeurope/SIFTS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdbe_sifts-1.0-py3-none-any.whl.

File metadata

  • Download URL: pdbe_sifts-1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdbe_sifts-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 99c5a45488a369c1adeb99572226074a6c98c4b4d97f81d70ce0a4a2758475d9
MD5 746768e1a4d765a4f5590a31c23d3ec6
BLAKE2b-256 6e59940e226ce953907302328253f8278f092bd6fc5c077a57b3fea9c38850c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdbe_sifts-1.0-py3-none-any.whl:

Publisher: python-publish.yml on PDBeurope/SIFTS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page