Build local CATH domain databases and generate structure-based domain fingerprints.
Project description
domain-finger-print
domain-finger-print is a Python package for:
- building a local CATH domain database from official domain-only structures
- generating structure-only fingerprints against that database with Foldseek
The package scope is intentionally narrow: it builds databases and writes fingerprint npz files. Visualization is not part of the package API.
What It Does
- Downloads official CATH classification files
- Downloads an official nonredundant CATH domain-only PDB archive (
S20orS40) - Extracts one PDB file per domain into a local
structures/directory - Excludes
...00whole-chain entries by default, so the local DB is chopped-domain focused - Builds a SQLite metadata index for downstream search
- Generates fixed-width fingerprints over the full CATH superfamily vocabulary
- Stores
qTMandtTMseparately, plus a stacked[qTM || tTM]matrix, in one compressednpz
Install
pip install -e . --no-build-isolation
Build a CATH Database
dfp build-db cath --out-dir ./cath_s20_db --redundancy 20
Useful flags:
--redundancy 20|40--version latest-release--keep-archive--include-whole-chain--force
Generate Fingerprints
Single query:
dfp fingerprint \
--db ./cath_s20_db \
--query ./queries/my_protein.pdb \
--out ./results/my_protein_fingerprint.npz
Directory of queries:
dfp fingerprint \
--db ./cath_s20_db \
--query-dir ./queries \
--glob "*.pdb" \
--out ./results/fingerprints_full.npz \
--workers 96 \
--prefilter-max-seqs 100
Useful flags:
--foldseek tools/foldseek/bin/foldseek--foldseek-db ./foldseek_db/cath_s20--foldseek-gpu--workers 96--prefilter-max-seqs 100--foldseek-sensitivity 9.5--min-domain-length 0--min-aligned-length 60
Output Format
The package writes one compressed npz containing:
query_labelsq_feature_labelst_feature_labelsstacked_feature_labelsq_matrixt_matrixstacked_matrixmetadata_json
Feature space is fixed by CATH superfamily. This means every run against the same database has the same dimensionality.
The stacked fingerprint is:
[qTM over all superfamilies || tTM over all superfamilies]
Schema details are documented in docs/fingerprint_npz_schema.md.
Output Layout
cath_s20_db/
├── db_info.json
├── downloads/
├── metadata.sqlite
└── structures/
Notes
- CATH
latest-releaseprovides nonredundant domain-only PDB archives forS20andS40. - By default the builder removes CATH entries whose domain number is
00, because those represent whole-chain entries without domain chopping. - Foldseek is used directly for structure search and scoring; the package does not currently expose TM-align reranking.
- Visualization, PCA, UMAP, and heatmaps are kept out of the installable package on purpose.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file domain_finger_print-0.1.0a1.tar.gz.
File metadata
- Download URL: domain_finger_print-0.1.0a1.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a171afd863609852e3a0e072b864b823cd3dd465a4951407b58072ad67651e9
|
|
| MD5 |
11897a863fb4f81c60f44702909e8df0
|
|
| BLAKE2b-256 |
4f3ce9f7b2d3776454657ca896ef9b5ad6ccc2e3f22fdf6f6cad7c2c1db78091
|
File details
Details for the file domain_finger_print-0.1.0a1-py3-none-any.whl.
File metadata
- Download URL: domain_finger_print-0.1.0a1-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
291ca93c2de6dace4f9505805ea1fe69014c24477fe2e4dc452789ea19006aa2
|
|
| MD5 |
72cff1d308afd2a2011403552ec29b81
|
|
| BLAKE2b-256 |
ef553309b26fb890dbe986aa97bf47d9f3f96d0e53d0e0af2351eab765e7de0b
|