Sequence-based programmatic access to the AlphaFold Protein Structure Database
Project description
afdb-query
Sequence-based programmatic access to the AlphaFold Protein Structure Database (AFDB). Query a protein by its amino-acid sequence, then pull per-residue pLDDT — including "the first n values" — without hand-rolling URL derivation and JSON fetching.
Install
pip install afdb-query
Quickstart
from afdb_query import AlphaFold
with AlphaFold() as af:
hits = af.search(sequence) # Tier 1: list[Structure], in AFDB's returned order
s = hits[0]
s.global_plddt # mean pLDDT for the model (cheap, from the summary)
s.sequence_identity # 1.0 == exact match, < 1.0 == near hit
s.uniprot_accession # e.g. "P12345", or None
p = s.plddt() # Tier 2: per-residue pLDDT (fetched once, then cached)
p.scores # full per-residue list[float]
p.first(50) # first 50 values — or all of them if the model is shorter
search raises InvalidSequenceError for sequences that cannot be queried
(internal stop *, shorter than 20 residues, or non-standard amino acids), and
returns [] when AFDB has no entry for a valid sequence.
Results come back in AFDB's returned order (ranked by sequence identity). Note that
hits[0] is not guaranteed to be the canonical AF-<accession>-F1 model — for
some sequences a multi-chain or AB-INITIO model ranks first — so pick the hit whose
model_identifier you want if you need a specific entry.
Batch lookups
search_many runs many sequences concurrently with resumable on-disk caching:
report = af.search_many(
[{"id": "rec1", "sequence": seq1}, {"id": "rec2", "sequence": seq2}],
out_dir="afdb_cache",
concurrency=6,
plddt_first_n=50, # optional: also save the first 50 per-residue pLDDT per hit
)
# report -> {"total":..., "hits":..., "misses":..., "errors":..., "skipped":..., ...}
- You supply a generic
idper sequence; it keys the cache file and maps back to your own records. out_dir/summaries/{id}.jsonstores each hit (a 404 miss stores{"structures": []}); existing files are left untouched, so re-runs resume.- With
plddt_first_nset,out_dir/plddt/{id}.jsonstores the raw first-n per-residue pLDDT array for the selected structure. - Real HTTP errors are counted but not saved, so they retry on the next run.
Picking the right structure (full_length=True)
By default search_many caches pLDDT for structures[0] — whatever AFDB ranks
first. That is not always the canonical single-chain model: for some sequences
a multi-chain or AB-INITIO model (e.g. twice the residue count) ranks first, so
structures[0] would give you the wrong per-residue array.
Pass full_length=True to require that the cached structure has
sequence_identity == 1.0 and a per-residue length equal to your query length:
report = af.search_many(
[{"id": "rec1", "sequence": seq1, "accession": "P12345"}], # accession optional
out_dir="afdb_cache",
plddt_first_n=9999999, # store the whole array; slice locally later
full_length=True,
)
-
Among exact-length, exact-sequence hits the optional per-record
accessionwins (AF-<accession>-F1); otherwise selection falls back to canonical-F1over numeric models, then highestglobal_plddt, deterministically. -
A record whose hits include no exact-length match is counted under
no_full_length(its summary is still written, so re-runs resume) and no pLDDT is cached. -
A hit chosen by fallback while more than one exact-sequence model matched is counted under
ambiguous— distinct sequences can be identical across organisms yet have different pLDDT, so supplyaccessionwhen the specific model matters. -
Because the residue count is only knowable from the confidence JSON, this mode fetches confidence (and may fetch more than one model) per record.
Note: resumability keys on the summary file. If you run once without
plddt_first_nand again with it, already-cached records are skipped and their pLDDT is not back-filled.
Not (yet) supported
- UniProt-accession lookup (sequence-only for now)
- PAE (Predicted Aligned Error)
- No statistics helpers — the package returns raw values; downstream math is yours.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file afdb_query-0.2.0.tar.gz.
File metadata
- Download URL: afdb_query-0.2.0.tar.gz
- Upload date:
- Size: 26.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23810afbabab6c30f76ab2a013ce822bbe2e467669abea2bf91037fb274c2698
|
|
| MD5 |
3f6e4676cf131a8d143cf4d806a3c871
|
|
| BLAKE2b-256 |
8c80f62671ebbbde086addcb020f36dbb5e01cdee51c4e1adc352b43b5aa0340
|
File details
Details for the file afdb_query-0.2.0-py3-none-any.whl.
File metadata
- Download URL: afdb_query-0.2.0-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c80a2de8b148da2898b4d58b8360a85d82c89f73e4fd1d992a3134616121b7cb
|
|
| MD5 |
13d93a0b5186f3add370b2d59d4319f3
|
|
| BLAKE2b-256 |
c433481f61edfe5ac401cf0bf48ce48f77fcf9ae838d8875ed4df0fec15ca2fa
|