Skip to main content

Build local CATH domain databases and generate structure-based domain fingerprints.

Project description

domain-finger-print

domain-finger-print is a Python package for:

  • building a local CATH domain database from official domain-only structures
  • generating structure-only fingerprints against that database with Foldseek

The package scope is intentionally narrow: it builds databases and writes fingerprint npz files. Visualization is not part of the package API.

What It Does

  • Downloads official CATH classification files
  • Downloads an official nonredundant CATH domain-only PDB archive (S20 or S40)
  • Extracts one PDB file per domain into a local structures/ directory
  • Excludes ...00 whole-chain entries by default, so the local DB is chopped-domain focused
  • Builds a SQLite metadata index for downstream search
  • Generates fixed-width fingerprints over the full target CATH domain vocabulary
  • Stores one compact fingerprint matrix where each hit score is (qTM + tTM) / 2

Install

pip install domain-finger-print

Foldseek must be installed separately and available as foldseek on PATH, or passed with --foldseek.

First-Time Setup

Configure Foldseek, download CATH S20/S40, and prebuild Foldseek indexes:

dfp init

This writes a config file to:

~/.config/domain_finger_print/config.json

Useful setup variants:

dfp init --redundancy 20
dfp init --redundancy 40 --default-redundancy 40
dfp init --redundancy both --data-dir ~/.cache/domain_finger_print
dfp init --foldseek /path/to/foldseek

Build a CATH Database

dfp build-db cath --out-dir ./cath_s20_db --redundancy 20

Useful flags:

  • --redundancy 20|40
  • --version latest-release
  • --keep-archive
  • --include-whole-chain
  • --force

Generate Fingerprints

Single query:

dfp fingerprint \
  --query ./queries/my_protein.pdb \
  --out ./results/my_protein_fingerprint.npz

If --db is omitted, dfp fingerprint uses the configured default database from dfp init. If no config exists, it automatically initializes CATH S20 first.

Directory of queries:

dfp fingerprint \
  --query-dir ./queries \
  --glob "*.pdb" \
  --recursive \
  --out ./results/fingerprints_full.npz \
  --workers 96 \
  --prefilter-max-seqs 100

Python API:

from domain_finger_print import collect_query_paths, generate_fingerprints

query_paths = collect_query_paths(query_dir="./queries", recursive=True)
generate_fingerprints(
    query_paths=query_paths,
    db_root="./cath_s20_db",
    out_path="./results/fingerprints.npz",
    workers=8,
)

Switch configured CATH versions:

dfp fingerprint \
  --redundancy 40 \
  --query-dir ./queries \
  --out ./results/fingerprints_s40.npz

Useful flags:

  • --db ./cath_s20_db
  • --redundancy 20|40
  • --config ~/.config/domain_finger_print/config.json
  • --foldseek tools/foldseek/bin/foldseek
  • --foldseek-db ./foldseek_db/cath_s20
  • --foldseek-gpu
  • --workers 96
  • --prefilter-max-seqs 100
  • --recursive
  • --foldseek-sensitivity 9.5
  • --foldseek-verbosity 0
  • --min-domain-length 0
  • --min-aligned-length 60

Output Format

The package writes one compressed npz containing:

  • query_labels
  • feature_labels
  • fingerprint_matrix
  • metadata_json

Feature space is fixed by target CATH domain ID. This means every run against the same database has the same dimensionality.

The fingerprint score is:

tm_score = (qTM + tTM) / 2

Each target CATH domain has its own column. Hits are not pooled by superfamily, so this keeps finer structural detail than a superfamily-level fingerprint. Target domains not returned by Foldseek, or filtered by --min-aligned-length, are stored as 0. The domain ID for each column is stored in metadata_json["feature_domain_ids"].

Schema details are documented in docs/fingerprint_npz_schema.md.

Output Layout

cath_s20_db/
├── db_info.json
├── downloads/
├── metadata.sqlite
└── structures/

Notes

  • CATH latest-release provides nonredundant domain-only PDB archives for S20 and S40.
  • CATH also publishes S35/S60 domain list files, but not matching nonredundant domain-only PDB archives in the same latest-release directory.
  • By default the builder removes CATH entries whose domain number is 00, because those represent whole-chain entries without domain chopping.
  • Foldseek is used directly for structure search and scoring; the package does not currently expose TM-align reranking.
  • Visualization, PCA, UMAP, and heatmaps are kept out of the installable package on purpose.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domain_finger_print-0.1.0a4.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

domain_finger_print-0.1.0a4-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file domain_finger_print-0.1.0a4.tar.gz.

File metadata

  • Download URL: domain_finger_print-0.1.0a4.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for domain_finger_print-0.1.0a4.tar.gz
Algorithm Hash digest
SHA256 1513d0b68d4f2887928baa32b3f650aeab46b57d809498b81d2d2aca94700fef
MD5 7bbfed8b1bd119750e73cc9e5067de2d
BLAKE2b-256 bd27a3c15da22e8f76eddd5dbd778f90e6b123c0958bf980c718e4c7b7385131

See more details on using hashes here.

File details

Details for the file domain_finger_print-0.1.0a4-py3-none-any.whl.

File metadata

File hashes

Hashes for domain_finger_print-0.1.0a4-py3-none-any.whl
Algorithm Hash digest
SHA256 167b9f154af7bb2df5583e7d16c30d7fbf851b20578aa071d13d241c385cc36d
MD5 552372833c104839e2b12aa98f9837f3
BLAKE2b-256 127a4d8b8a12b09ca804a72beaec7774dce5d8bd752e23ba8309ba72d1190af9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page