Python package to run SIFTS from EMBL-EBI locally.

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

Adam_Bellaiche pdbe

These details have not been verified by PyPI

Project description

pdbe_sifts

Python package to run the SIFTS (Structure Integration with Function, Taxonomy and Sequences) pipeline locally.

Developed at EMBL-EBI by the PDBe team.

What is SIFTS?

SIFTS provides residue-level mappings between structures and sequences. This package automates the full pipeline:

Build a reference sequence database (MMseqs2 or BLASTP)
Align structure sequences against it to identify the best match per chain (≥ 90% identity) according to the SIFTS scoring function
Generate precise residue- and segment-level structure-sequence mappings via local alignment (FASTA36 lalign36)
Store results in a DuckDB database and per-entry CSV files
Export mappings back into annotated mmCIF files

The whole pipeline can work on non-UniProt or non-PDB entries. However, it will use only the adjusted score to rank the hits.

Installation

System dependencies

The following binaries must be installed and available on PATH:

Tool	Purpose	Install
MMseqs2	Fast global sequence search	`conda install -c conda-forge mmseqs2`
FASTA36 (`lalign36`)	Local pairwise alignment	`conda install -c bioconda fasta3`
BLAST+	Optional alternative to MMseqs2	`conda install -c bioconda blast`

A. Install using micromamba (recommended)

# Create environment from file
micromamba env create -f environment.yml

# Activate environment
micromamba activate pdbe_sifts

# Install pdbe_sifts package in editable mode
pip install -e .

# Or install directly
pip install pdbe_sifts

B. Install using uv (fast alternative if only pdbe_sifts python package is needed)

1. Install uv (if not already installed)

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Alternative via pip
pip install uv

2. Clone the repository

git clone https://github.com/PDBeurope/SIFTS
cd SIFTS

3. Create virtual environment and install dependencies

# Create a virtual environment and install all dependencies
uv sync

# This will:
# - Create a .venv directory
# - Install Python 3.10 if needed
# - Install all dependencies from pyproject.toml
# - Lock versions in uv.lock

4. Activate the virtual environment

# macOS/Linux
source .venv/bin/activate

# Windows
.venv\Scripts\activate

Requirements: Python ≥ 3.10 · 16 GB RAM minimum (32 GB+ recommended for large datasets)

Quick Start

1 — Initialise your config

pdbe_sifts init
# → creates ~/.config/pdbe_sifts/config.yaml
# → downloads the NCBI taxonomy database (~70 MB, first run only)

Edit the config to set your paths (base_dir, nobackup_dir, target_db (after building it), etc.). You can also setup different alignment parameters.

2 — Build a reference database

pdbe_sifts build_db \
  -i uniprot_sprot.fasta \
  -o ./my_db \
  -t taxonomy_mapping.tsv   # TSV: sequence_id <tab> tax_id

3 — Run structure to sequence matching

# Single CIF entry
pdbe_sifts sequence_match -i 1abc.cif -o ./results -d ./my_db/target_db

# Batch (one mmCIF path per line)
pdbe_sifts sequence_match -i entries.txt -o ./results -d ./my_db/target_db --threads 8

At this step you can also provide a .csv file to faster the scoring function. This CSV file must contains per row: row_num, uniprot_accession, dataset (Swiss-Prot or TrEMBL), pdb cross-references, annotation score.

Produces hits.duckdb and hits.tsv — a scored and raw table of sequence candidates per structure entity.

4 — Generate SIFTS segments and residue mappings

# With DuckDB hits (from structure to sequence matching step)
pdbe_sifts segments -i 1abc.cif.gz -o ./segments -d hits.duckdb

# Manual structure-sequence mapping (chain:accession)
pdbe_sifts segments -i 1abc.cif.gz -o ./segments -m "A:P00963,B:P00963"

# Custom FASTA mapping (headers: >{structure_id}|{auth_asym_id}|{sequence_id})
pdbe_sifts segments -i 1abc.cif.gz -o ./segments -m custom_seqs.fasta

Produces per-entry gzip-compressed CSV files under {output_dir}/.

5 — Load segment data into DuckDB

pdbe_sifts db_load -i ./segments/ -d hits.duckdb

Bulk-loads the segment and residue CSVs produced in step 4 into the sifts_xref_segment and sifts_xref_residue tables of the DuckDB file.

6 — Annotate mmCIF files with residue level mappings and SIFTS data

# Reading from DuckDB (after step 5)
pdbe_sifts sifts2mmcif \
  -i 1abc.cif.gz \
  -o ./sifts_mmcif \
  -d hits.duckdb

# Or reading segment CSVs directly (skip step 5)
pdbe_sifts sifts2mmcif \
  -i 1abc.cif.gz \
  -o ./sifts_mmcif \
  -s ./segments/

CLI Reference

Command	Description
`pdbe_sifts init`	Copy default config to `~/.config/pdbe_sifts/config.yaml` and init NCBI taxonomy DB
`pdbe_sifts show`	Print the fully resolved configuration
`pdbe_sifts update_ncbi`	Force-update the local NCBI taxonomy database (ete4)
`pdbe_sifts build_db`	Build a reference sequence database (MMseqs2 or BLASTP) from a FASTA file
`pdbe_sifts fasta_build`	Extract entity sequences from mmCIF files and write a FASTA
`pdbe_sifts sequence_match`	Align structure sequences against the reference DB; score and store hits in DuckDB
`pdbe_sifts segments`	Generate SIFTS mappings for a single mmCIF entry
`pdbe_sifts db_load`	Bulk-load segment/residue CSVs from segments generation into DuckDB
`pdbe_sifts sifts2mmcif`	Inject SIFTS mappings into an annotated mmCIF file
`pdbe_sifts update_ccd_mapping`	Check whether the remote CCD file is newer than the cached three-to-one letter mapping CSV and regenerate it if so
`pdbe_sifts seq2seq`	Align canonical deposited sequence vs coordinate sequence

Useful Classes

The pipeline classes can be used directly in Python scripts without going through the CLI.

`TargetDb` — Build a reference sequence database

from pdbe_sifts.sequence_match.target_database import TargetDb

TargetDb(
    input_path="uniprot_sprot.fasta",
    output_path="./my_db/target_db",
    tax_mapping_file="taxonomy.tsv",
    tool="mmseqs",   # or "blastp"
    threads=8,
).run()

`FastaBuilder` — Extract sequences from mmCIF files

from pdbe_sifts.sifts_fasta_builder import FastaBuilder

fasta_path = FastaBuilder(
    input_path="1abc.cif",   # or .cif.gz, or a .txt file listing CIF paths
    out_dir="./fasta/",
    threads=4,
).build()

`SiftsSequenceMatch` — Run the alignment and scoring pipeline

from pdbe_sifts.sifts_sequence_match import SiftsSequenceMatch

SiftsSequenceMatch(
    input_file="1abc.cif",   # or .fasta, or a .txt list of CIF paths
    out_dir="./results/",
    db_file="./my_db/target_db",
    tool="mmseqs",           # or "blastp"
    threads=8,
).process()
# → writes hits.duckdb and hits_<entry>.tsv to out_dir

`SiftsAlign` — Generate per-entry segment and residue mappings

from pdbe_sifts.sifts_segments_generation import SiftsAlign

# Mode 1: use scored hits from sequence_match
sa = SiftsAlign(
    cif_file="1abc.cif",
    out_dir="./segments/",
    db_conn_str="hits.duckdb",
)

# Mode 2: provide a manual mapping (accessions or custom FASTA)
sa = SiftsAlign(
    cif_file="1abc.cif",
    out_dir="./segments/",
    unp_mode="A:P00963,B:P00963",   # or path to a FASTA file
)

sa.process_entry("1abc")
if sa.conn:
    sa.conn.close()
# → writes {out_dir}/1abc_seg.csv.gz
#           {out_dir}/1abc_res.csv.gz

`SiftsDB` — Bulk-load segment CSVs into DuckDB

import duckdb
from pdbe_sifts.database.sifts_db_wrapper import SiftsDB

conn = duckdb.connect("hits.duckdb")
SiftsDB(conn).bulk_load_from_entries("./segments/")
conn.close()

Outputs

Global mappings

File	Format	Content
`hits.duckdb`	DuckDB	Scored sequence accession candidates per structure entity
`hits_<entry>.tsv`	TSV	Raw MMseqs2 / BLASTP alignment hits

Segment generation

Per entry, under {output_dir}:

File	Format	Content
`{entry}_seg.csv.gz`	CSV (gzip)	One row per contiguous aligned range (structure ↔ sequence positions, identity, conflicts, chimera flag)
`{entry}_res.csv.gz`	CSV (gzip)	One row per mapped structure residue (auth seq id, sequence position, one-letter codes, observed flag)
`{entry}_nf90_seg.csv.gz`	CSV (gzip)	NF90 variant of the segment file (written when applicable)

After running db_load, results are available in DuckDB tables sifts_xref_segment and sifts_xref_residue.

Environment Variables

Variable	Default	Description
`SIFTS_LOG_LEVEL`	`INFO`	Logging verbosity: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`
`SIFTS_N_PROC`	auto	Number of internal threads per worker (lalign36 jobs). Override manually to cap CPU use.
`SIFTS_NO_CACHE_ALL`	unset	If set (any value), disables the UniProt pickle cache and always fetches from the REST API.
`SLURM_CPUS_PER_TASK`	unset	Detected automatically on SLURM clusters. Used by `get_allocated_cpus()` to set the thread count when running under a SLURM job allocation.

Project Structure

src/pdbe_sifts/
├── cli.py                         # CLI entry point (pdbe_sifts command)
├── sifts_sequence_match.py       # Global mapping pipeline (SiftsSequenceMatch)
├── sifts_segments_generation.py   # Single-entry segment generation (SiftsAlign)
├── sifts_fasta_builder.py         # Extract sequences from mmCIF → FASTA (FastaBuilder)
├── sifts_database_loader.py       # Standalone bulk-loader script (wraps SiftsDB)
├── config/                        # OmegaConf configuration loading — defines load_config()
├── base/
│   ├── paths.py                   # All configuration getters (imports load_config() from config/)
│   ├── utils.py                   # UniProt fetch, CPU helpers, SiftsAction
│   ├── log.py                     # Logging setup (StreamHandler, coloredlogs)
│   └── exceptions.py              # All custom exceptions (centralised)
├── database/
│   └── sifts_db_wrapper.py        # SiftsDB: DuckDB schema + bulk loader
├── mmcif/                         # mmCIF parsing (Entry, Chain, Entity, Residue, ChemComp)
├── sequence_match/
│   ├── target_database.py         # Build MMseqs2 / BLAST reference database (TargetDb)
│   ├── mmseqs_search.py           # MMseqs2 easy-search wrapper
│   ├── blastp.py                  # BLASTP wrapper
│   └── sequence_match_parser.py  # Parse TSV hits, score, store in DuckDB
├── segments_generation/
│   └── alignment/                 # lalign36 wrapper, isoform alignment, residue mapping
├── sifts_to_mmcif/                # Inject SIFTS data back into mmCIF files
├── unp/
│   └── unp.py                     # UniProt REST client, pickle cache, isoform handling
└── data/
    └── default_config.yaml        # Default configuration template (all tuneable params)

Authors

EMBL-EBI PDBe team: Adam Bellaiche, Preeti Choudhary, Sreenath Sasidharan Nair, Jennifer Fleming, Sameer Velankar

License

Apache-2.0

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

Adam_Bellaiche pdbe

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdbe_sifts-1.0.tar.gz (39.5 MB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdbe_sifts-1.0-py3-none-any.whl (39.5 MB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file pdbe_sifts-1.0.tar.gz.

File metadata

Download URL: pdbe_sifts-1.0.tar.gz
Upload date: Apr 30, 2026
Size: 39.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdbe_sifts-1.0.tar.gz
Algorithm	Hash digest
SHA256	`b47e000bbfd7f3a9894a089d8be2679942ac66ddd01b47a83a100cf07e7cc3d3`
MD5	`8820ac4c4b6eec2922ba79468056ec75`
BLAKE2b-256	`ba39dd237b357ae650fa2196c4dd7c2b87d2d03c91c1261f4aabd3c33d9d9a7a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdbe_sifts-1.0.tar.gz:

Publisher: python-publish.yml on PDBeurope/SIFTS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdbe_sifts-1.0.tar.gz
- Subject digest: b47e000bbfd7f3a9894a089d8be2679942ac66ddd01b47a83a100cf07e7cc3d3
- Sigstore transparency entry: 1409768404
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: PDBeurope/SIFTS@155a4258536078ff0251644f154e895b7da6c492
- Branch / Tag: refs/tags/v1.0.4
- Owner: https://github.com/PDBeurope
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@155a4258536078ff0251644f154e895b7da6c492
- Trigger Event: push

File details

Details for the file pdbe_sifts-1.0-py3-none-any.whl.

File metadata

Download URL: pdbe_sifts-1.0-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 39.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdbe_sifts-1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99c5a45488a369c1adeb99572226074a6c98c4b4d97f81d70ce0a4a2758475d9`
MD5	`746768e1a4d765a4f5590a31c23d3ec6`
BLAKE2b-256	`6e59940e226ce953907302328253f8278f092bd6fc5c077a57b3fea9c38850c6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdbe_sifts-1.0-py3-none-any.whl:

Publisher: python-publish.yml on PDBeurope/SIFTS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdbe_sifts-1.0-py3-none-any.whl
- Subject digest: 99c5a45488a369c1adeb99572226074a6c98c4b4d97f81d70ce0a4a2758475d9
- Sigstore transparency entry: 1409768421
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: PDBeurope/SIFTS@155a4258536078ff0251644f154e895b7da6c492
- Branch / Tag: refs/tags/v1.0.4
- Owner: https://github.com/PDBeurope
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@155a4258536078ff0251644f154e895b7da6c492
- Trigger Event: push

pdbe-sifts 1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pdbe_sifts

What is SIFTS?

Installation

System dependencies

A. Install using micromamba (recommended)

B. Install using uv (fast alternative if only pdbe_sifts python package is needed)

1. Install uv (if not already installed)

2. Clone the repository

3. Create virtual environment and install dependencies

4. Activate the virtual environment

Quick Start

1 — Initialise your config

2 — Build a reference database

3 — Run structure to sequence matching

4 — Generate SIFTS segments and residue mappings

5 — Load segment data into DuckDB

6 — Annotate mmCIF files with residue level mappings and SIFTS data

CLI Reference

Useful Classes

TargetDb — Build a reference sequence database

FastaBuilder — Extract sequences from mmCIF files

SiftsSequenceMatch — Run the alignment and scoring pipeline

SiftsAlign — Generate per-entry segment and residue mappings

SiftsDB — Bulk-load segment CSVs into DuckDB

Outputs

Global mappings

Segment generation

Environment Variables

Project Structure

Authors

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`TargetDb` — Build a reference sequence database

`FastaBuilder` — Extract sequences from mmCIF files

`SiftsSequenceMatch` — Run the alignment and scoring pipeline

`SiftsAlign` — Generate per-entry segment and residue mappings

`SiftsDB` — Bulk-load segment CSVs into DuckDB