Skip to main content

Build local CATH domain databases and generate structure-based domain fingerprints.

Project description

domain-finger-print

domain-finger-print is a Python package for:

  • building a local CATH domain database from official domain-only structures
  • generating structure-only fingerprints against that database with Foldseek

The package scope is intentionally narrow: it builds databases and writes fingerprint npz files. Visualization is not part of the package API.

What It Does

  • Downloads official CATH classification files
  • Downloads an official nonredundant CATH domain-only PDB archive (S20 or S40)
  • Extracts one PDB file per domain into a local structures/ directory
  • Excludes ...00 whole-chain entries by default, so the local DB is chopped-domain focused
  • Builds a SQLite metadata index for downstream search
  • Generates fixed-width fingerprints over the full CATH superfamily vocabulary
  • Stores qTM and tTM separately, plus a stacked [qTM || tTM] matrix, in one compressed npz

Install

pip install -e . --no-build-isolation

Build a CATH Database

dfp build-db cath --out-dir ./cath_s20_db --redundancy 20

Useful flags:

  • --redundancy 20|40
  • --version latest-release
  • --keep-archive
  • --include-whole-chain
  • --force

Generate Fingerprints

Single query:

dfp fingerprint \
  --db ./cath_s20_db \
  --query ./queries/my_protein.pdb \
  --out ./results/my_protein_fingerprint.npz

Directory of queries:

dfp fingerprint \
  --db ./cath_s20_db \
  --query-dir ./queries \
  --glob "*.pdb" \
  --out ./results/fingerprints_full.npz \
  --workers 96 \
  --prefilter-max-seqs 100

Useful flags:

  • --foldseek tools/foldseek/bin/foldseek
  • --foldseek-db ./foldseek_db/cath_s20
  • --foldseek-gpu
  • --workers 96
  • --prefilter-max-seqs 100
  • --foldseek-sensitivity 9.5
  • --min-domain-length 0
  • --min-aligned-length 60

Output Format

The package writes one compressed npz containing:

  • query_labels
  • q_feature_labels
  • t_feature_labels
  • stacked_feature_labels
  • q_matrix
  • t_matrix
  • stacked_matrix
  • metadata_json

Feature space is fixed by CATH superfamily. This means every run against the same database has the same dimensionality.

The stacked fingerprint is:

[qTM over all superfamilies || tTM over all superfamilies]

Schema details are documented in docs/fingerprint_npz_schema.md.

Output Layout

cath_s20_db/
├── db_info.json
├── downloads/
├── metadata.sqlite
└── structures/

Notes

  • CATH latest-release provides nonredundant domain-only PDB archives for S20 and S40.
  • By default the builder removes CATH entries whose domain number is 00, because those represent whole-chain entries without domain chopping.
  • Foldseek is used directly for structure search and scoring; the package does not currently expose TM-align reranking.
  • Visualization, PCA, UMAP, and heatmaps are kept out of the installable package on purpose.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domain_finger_print-0.1.0a1.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

domain_finger_print-0.1.0a1-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file domain_finger_print-0.1.0a1.tar.gz.

File metadata

  • Download URL: domain_finger_print-0.1.0a1.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for domain_finger_print-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 8a171afd863609852e3a0e072b864b823cd3dd465a4951407b58072ad67651e9
MD5 11897a863fb4f81c60f44702909e8df0
BLAKE2b-256 4f3ce9f7b2d3776454657ca896ef9b5ad6ccc2e3f22fdf6f6cad7c2c1db78091

See more details on using hashes here.

File details

Details for the file domain_finger_print-0.1.0a1-py3-none-any.whl.

File metadata

File hashes

Hashes for domain_finger_print-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 291ca93c2de6dace4f9505805ea1fe69014c24477fe2e4dc452789ea19006aa2
MD5 72cff1d308afd2a2011403552ec29b81
BLAKE2b-256 ef553309b26fb890dbe986aa97bf47d9f3f96d0e53d0e0af2351eab765e7de0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page