Skip to main content

Pdb_cpp is a python library allowing simple operations on pdb coor files.

Project description

Documentation Status codecov CI PyPI - Version Downloads License: GPL v2

pdb_cpp

pdb_cpp is a structural bioinformatics toolkit with a C++ core and Python API for fast PDB/mmCIF parsing, atom selection, sequence/structure alignment, TM-score, and DockQ evaluation.

pdb_cpp logo

What is included

  • Read/write .pdb, .cif, .pqr, and .gro files
  • Atom/residue/chain selections (including geometric within queries)
  • Sequence extraction and pairwise sequence alignment
  • Sequence-based structural superposition and chain-permutation alignment
  • TM-align/TM-score through the bundled USalign/TM-align core
  • DockQ metrics (DockQ, Fnat, Fnonnat, LRMS, iRMS, rRMS)
  • Hydrogen bond detection (Baker & Hubbard geometric method, no explicit H required)
  • Secondary structure assignment
  • Core geometric helpers (e.g., distance matrix)

Installation

From PyPI

python -m pip install pdb-cpp

From source

git clone https://github.com/samuelmurail/pdb_cpp
cd pdb_cpp
python -m pip install -e .

For development:

python -m pip install -r requirements.txt
pytest

Quick start

from pdb_cpp import Coor

# Load from local file
coor = Coor("tests/input/1y0m.cif")

# Or fetch by PDB ID (mmCIF is downloaded and cached)
coor_pdb = Coor(pdb_id="1y0m")

# Or use the RCSB helper for explicit structure choices
from pdb_cpp import rcsb

bio_assembly = rcsb.load("5a9z", structure="biological_assembly", assembly_id=1)
asym_unit = rcsb.load("5a9z", structure="asymmetric_unit")

print(coor.model_num)        # number of models
print(coor.get_aa_seq())     # chain -> sequence

# Write selection/structure back to disk
coor.write("out_structure.pdb")

Selection language (including complex selections)

select_atoms() supports boolean logic, numeric comparisons, and spatial queries:

# Backbone residues 6..58 from chain A
sel_1 = coor.select_atoms("backbone and chain A and residue >= 6 and residue <= 58")

# Interface-like query: atoms of chain A within 5 Å of chain B
sel_2 = coor.select_atoms("chain A and within 5.0 of chain B")

# Combination with negation
sel_3 = coor.select_atoms("name CA and not within 5.0 of resname HOH")

# Numeric filters
sel_4 = coor.select_atoms("resname HOH and x >= 20.0")

Common keywords: name, resname, chain, resid, residue, x, y, z, beta, occ, protein, backbone, noh, within.

Sequence alignment and structure superposition

from pdb_cpp import Coor, alignment, core, analysis

coor_1 = Coor("tests/input/1u85.pdb")
coor_2 = Coor("tests/input/1ubd.pdb")

seq_1 = coor_1.get_aa_seq()["A"]
seq_2 = coor_2.get_aa_seq()["C"]
aln_1, aln_2, score = alignment.align_seq(seq_1, seq_2)
alignment.print_align_seq(aln_1, aln_2)

# Get atom correspondences and align coordinates in-place
idx_1, idx_2 = core.get_common_atoms(coor_1, coor_2, chain_1=["A"], chain_2=["C"])
core.coor_align(coor_1, coor_2, idx_1, idx_2, frame_ref=0)

# RMSD after alignment
rmsd_values = analysis.rmsd(coor_1, coor_2, index_list=[idx_1, idx_2])
print(rmsd_values[0])

For multi-chain complexes with uncertain chain mapping, use chain permutation:

rmsds, index_mappings = alignment.align_chain_permutation(coor_1, coor_2)

TM-align / TM-score

from pdb_cpp import Coor
from pdb_cpp.core import tmalign_ca

ref = Coor("tests/input/1y0m.cif")
mob = Coor("tests/input/1ubd.pdb")

result = tmalign_ca(ref, mob, chain_1=["A"], chain_2=["C"], mm=1)

print(result.L_ali)  # aligned length
print(result.rmsd)   # RMSD on aligned residues
print(result.TM1)    # TM-score normalized by structure 1
print(result.TM2)    # TM-score normalized by structure 2

If you use the USalign/TM-align functionality in pdb_cpp, please cite:

  • Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang (2022) Nat Methods. 19(9), 1109-1115.
  • Chengxin Zhang, Anna Marie Pyle (2022) iScience. 25(10), 105218.

Secondary structure helper:

from pdb_cpp import TMalign

ss = TMalign.compute_secondary_structure(ref)
print(ss[0]["A"])  # DSSP-like secondary structure string for chain A

DockQ scoring

from pdb_cpp import Coor, analysis

model = Coor("tests/input/1rxz_colabfold_model_1.pdb")
native = Coor("tests/input/1rxz.pdb")

# Chain roles can be inferred automatically,
# or provided explicitly with rec_chains/lig_chains arguments.
scores = analysis.dockQ(model, native)

print(scores["DockQ"][0])
print(scores["Fnat"][0], scores["Fnonnat"][0])
print(scores["LRMS"][0], scores["iRMS"][0], scores["rRMS"][0])

If you use DockQ scoring in pdb_cpp, please cite:

  • DockQ, DOI: 10.1093/bioinformatics/btae586

Hydrogen bond detection

pdb_cpp.hbond identifies hydrogen bonds using the Baker & Hubbard geometric criteria. Hydrogen positions are reconstructed algebraically when not present in the file, so no pre-processing step is required.

from pdb_cpp import Coor
from pdb_cpp import hbond

coor = Coor("tests/input/2rri.cif")

# One list of HBond objects per model frame
all_bonds = hbond.hbonds(coor)
print(f"Model 0: {len(all_bonds[0])} H-bonds")

# Inspect bond geometry
b = all_bonds[0][0]
print(f"Donor  : {b.donor_chain}{b.donor_resid} {b.donor_heavy_name}")
print(f"Acceptor: {b.acceptor_chain}{b.acceptor_resid} {b.acceptor_name}")
print(f"d(D..A) = {b.dist_DA:.2f} Å  ∠DHA = {b.angle_DHA:.1f}°")

# Cross-selection: protein donors to nucleic-acid acceptors
rna_bonds = hbond.hbonds(coor, donor_sel="protein", acceptor_sel="nucleic")

Default cutoffs follow Baker & Hubbard (1984): dist_DA_cutoff=3.5 Å, dist_HA_cutoff=2.5 Å, angle_cutoff=90°.

Geometry utilities

from pdb_cpp import Coor, geom

coor = Coor("tests/input/1y0m.cif")
ca = coor.select_atoms("name CA")
dist = geom.distance_matrix(ca, ca)
print(dist.shape)

Benchmarks

Benchmark scripts are available in benchmark/README.md:

  • DockQ vs pdb_cpp implementation
  • I/O read/write speed
  • Common operation speed comparisons (pdb_cpp, pdb_numpy, biopython, biotite)

Documentation

Notes for contributors (C++ core)

When adding C++ features:

  1. Add implementation files in src/pdb_cpp/_core/
  2. Register sources in setup.py
  3. Expose bindings in src/pdb_cpp/_core/pybind.cpp
  4. Reinstall extension (pip install -e . --no-build-isolation) and run tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdb_cpp-0.2.0.tar.gz (175.2 kB view details)

Uploaded Source

File details

Details for the file pdb_cpp-0.2.0.tar.gz.

File metadata

  • Download URL: pdb_cpp-0.2.0.tar.gz
  • Upload date:
  • Size: 175.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for pdb_cpp-0.2.0.tar.gz
Algorithm Hash digest
SHA256 89be157f17b06729a05f94ac034fd0a4e85e897bd3d8d19d70b9fc41df7bfef2
MD5 b82a1558e5014c8381e3021c4fde20fc
BLAKE2b-256 810df2f11b9ba60286705ce44eb83de30ec67baca8abb2c30f0df8570b945e8f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page