Build local CATH domain databases and generate structure-based domain fingerprints.
Project description
domain-finger-print
domain-finger-print is a Python package for:
- building a local CATH domain database from official domain-only structures
- generating structure-only fingerprints against that database with Foldseek
The package scope is intentionally narrow: it builds databases and writes fingerprint npz files. Visualization is not part of the package API.
What It Does
- Downloads official CATH classification files
- Downloads an official nonredundant CATH domain-only PDB archive (
S20orS40) - Extracts one PDB file per domain into a local
structures/directory - Excludes
...00whole-chain entries by default, so the local DB is chopped-domain focused - Builds a SQLite metadata index for downstream search
- Generates fixed-width fingerprints over the full target CATH domain vocabulary
- Stores one compact fingerprint matrix where each hit score is
(qTM + tTM) / 2
Install
pip install domain-finger-print
Foldseek must be installed separately and available as foldseek on PATH, or passed with --foldseek.
First-Time Setup
Configure Foldseek, download CATH S20/S40, and prebuild Foldseek indexes:
dfp init
This writes a config file to:
~/.config/domain_finger_print/config.json
Useful setup variants:
dfp init --redundancy 20
dfp init --redundancy 40 --default-redundancy 40
dfp init --redundancy both --data-dir ~/.cache/domain_finger_print
dfp init --foldseek /path/to/foldseek
Build a CATH Database
dfp build-db cath --out-dir ./cath_s20_db --redundancy 20
Useful flags:
--redundancy 20|40--version latest-release--keep-archive--include-whole-chain--force
Generate Fingerprints
Single query:
dfp fingerprint \
--query ./queries/my_protein.pdb \
--out ./results/my_protein_fingerprint.npz
If --db is omitted, dfp fingerprint uses the configured default database from dfp init. If no config exists, it automatically initializes CATH S20 first.
Directory of queries:
dfp fingerprint \
--query-dir ./queries \
--glob "*.pdb" \
--recursive \
--out ./results/fingerprints_full.npz \
--workers 96 \
--prefilter-max-seqs 100
Python API:
from domain_finger_print import collect_query_paths, generate_fingerprints
query_paths = collect_query_paths(query_dir="./queries", recursive=True)
generate_fingerprints(
query_paths=query_paths,
db_root="./cath_s20_db",
out_path="./results/fingerprints.npz",
workers=8,
)
Switch configured CATH versions:
dfp fingerprint \
--redundancy 40 \
--query-dir ./queries \
--out ./results/fingerprints_s40.npz
Useful flags:
--db ./cath_s20_db--redundancy 20|40--config ~/.config/domain_finger_print/config.json--foldseek tools/foldseek/bin/foldseek--foldseek-db ./foldseek_db/cath_s20--foldseek-gpu--workers 96--prefilter-max-seqs 100--recursive--foldseek-sensitivity 9.5--foldseek-verbosity 0--min-domain-length 0--min-aligned-length 60
Output Format
The package writes one compressed npz containing:
query_labelsfeature_labelsfingerprint_matrixmetadata_json
Feature space is fixed by target CATH domain ID. This means every run against the same database has the same dimensionality.
The fingerprint score is:
tm_score = (qTM + tTM) / 2
Each target CATH domain has its own column. Hits are not pooled by superfamily, so this keeps finer structural detail than a superfamily-level fingerprint. Target domains not returned by Foldseek, or filtered by --min-aligned-length, are stored as 0.
The domain ID for each column is stored in metadata_json["feature_domain_ids"].
Schema details are documented in docs/fingerprint_npz_schema.md.
Output Layout
cath_s20_db/
├── db_info.json
├── downloads/
├── metadata.sqlite
└── structures/
Notes
- CATH
latest-releaseprovides nonredundant domain-only PDB archives forS20andS40. - CATH also publishes S35/S60 domain list files, but not matching nonredundant domain-only PDB archives in the same
latest-releasedirectory. - By default the builder removes CATH entries whose domain number is
00, because those represent whole-chain entries without domain chopping. - Foldseek is used directly for structure search and scoring; the package does not currently expose TM-align reranking.
- Visualization, PCA, UMAP, and heatmaps are kept out of the installable package on purpose.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file domain_finger_print-0.1.0a4.tar.gz.
File metadata
- Download URL: domain_finger_print-0.1.0a4.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1513d0b68d4f2887928baa32b3f650aeab46b57d809498b81d2d2aca94700fef
|
|
| MD5 |
7bbfed8b1bd119750e73cc9e5067de2d
|
|
| BLAKE2b-256 |
bd27a3c15da22e8f76eddd5dbd778f90e6b123c0958bf980c718e4c7b7385131
|
File details
Details for the file domain_finger_print-0.1.0a4-py3-none-any.whl.
File metadata
- Download URL: domain_finger_print-0.1.0a4-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
167b9f154af7bb2df5583e7d16c30d7fbf851b20578aa071d13d241c385cc36d
|
|
| MD5 |
552372833c104839e2b12aa98f9837f3
|
|
| BLAKE2b-256 |
127a4d8b8a12b09ca804a72beaec7774dce5d8bd752e23ba8309ba72d1190af9
|