Sequence-based programmatic access to the AlphaFold Protein Structure Database

Project description

afdb-query

Sequence-based programmatic access to the AlphaFold Protein Structure Database (AFDB). Query a protein by its amino-acid sequence, then pull per-residue pLDDT without hand-rolling URL derivation and JSON fetching.

Install

pip install afdb-query

Quickstart

from afdb_query import AlphaFold

with AlphaFold() as af:
    hits = af.search(sequence)        # Tier 1: list[Structure], in AFDB's returned order
    s = hits[0]

    s.global_plddt        # mean pLDDT AFDB reports for the model (from the summary)
    s.sequence_identity   # 1.0 == exact match
    s.oligomeric_state    # "MONOMER", "HOMODIMER", ...
    s.uniprot_accession   # e.g. "P12345", or None

    p = s.plddt()         # Tier 2: per-residue pLDDT (fetched once, then cached)
    p.scores              # full per-residue list[float]
    p.residue_numbers     # parallel residue numbers — do not assume 1..N

search raises InvalidSequenceError for sequences that cannot be queried (internal stop *, shorter than 20 residues, or non-standard amino acids), and returns [] when AFDB has no entry for a valid sequence. Every HTTP or transport failure is raised as AFDBHTTPError, so catching AFDBError is enough — you never need to import httpx.

Choosing a structure

A sequence query can match several structures, and they are not interchangeable:

hits[0] is not guaranteed to be the canonical AF-<accession>-F1 model. For some sequences a multi-chain or numeric AB-INITIO model ranks first.
global_plddt (confidence_avg_local_score) is averaged over the whole deposited model. For a HOMODIMER or HETERODIMER that spans every chain, not just the one matching your query, so it is not comparable with a monomer's.

select_group applies three preference tiers — caller's accession, then monomers over complexes, then canonical -F1 over numeric ids — and returns everything still tied:

from afdb_query import AlphaFold, select_group, mean_global_plddt, is_monomer

with AlphaFold() as af:
    doc = af.fetch_summary(sequence)           # None when AFDB has no entry
    group = select_group(doc["structures"]) if doc else []

    if not all(is_monomer(s) for s in group):
        ...                                   # your call: skip, or use it knowingly

    plddt = mean_global_plddt(group)           # average across the group, not one member

No tier reads the confidence scores. Ranking candidates by pLDDT and returning the winner is not merely arbitrary, it is biased: the expected maximum of N draws rises with N, so a comparison whose two arms match different numbers of candidates gets different inflation on each side.

There is no fourth tier, and that is deliberate. After the three tiers a tie is common — 25.5% of records on a real cache. Breaking it by any rule at all throws away N−1 predictions of the same sequence for nothing. select_group keeps them and mean_global_plddt averages across them: unbiased for the same reason an arbitrary pick is (neither consults the value), with strictly lower variance because it uses every prediction.

Tied candidates are near-always one protein reached through different UniProt entries — identical-sequence orthologs, isoform duplicates, TrEMBL redundancy. On a real cache 4,288 of 4,289 tied sets were entirely full-length -F1 models, so their per-residue arrays are the same length and average elementwise too.

Do not take group[0]. The ordering is for reproducible output only; treating it as "the" structure reintroduces exactly the arbitrary choice this API exists to avoid.

Length is not visible to the tiers

coverage == 1.0 means the query is fully covered by the model, not that the model is the query's size — that is how an 860-residue multi-chain entry reports coverage 1.0 against a 430-residue query. Filtering to monomers removes that case, but an ortholog carrying the query sequence plus a few extra terminal residues passes every visible tier, and then breaks two things: positional slicing (an offset from the query's amino-acid length indexes the wrong residues) and the group average (its mean spans residues your query does not contain).

The summary carries no residue count, so filter_by_length takes the lengths you learned from fetching per-residue confidence:

from afdb_query import filter_by_length

lengths = {s["model_identifier"]: len(fetched[s["model_identifier"]]) for s in group}
kept, dropped = filter_by_length(group, lengths, expected_length=len(sequence))
if dropped:
    log(f"{len(dropped)} of {len(group)} matched entries were the wrong length")

Unknown lengths are dropped, never assumed to conform, and dropped members are returned rather than discarded so the loss is reportable.

select_group does not filter, either. "No usable structure" and "a group whose average spans a complex" need different handling, and only the caller knows which its analysis can tolerate — so test is_monomer on the members and decide.

Batch lookups

search_many runs many sequences concurrently with resumable on-disk caching:

report = af.search_many(
    [{"id": "rec1", "sequence": seq1}, {"id": "rec2", "sequence": seq2}],
    out_dir="afdb_cache",
    concurrency=6,
)

{
  "total":    2,
  "skipped":  0,                                                   # already cached
  "filtered": {"internal_stop": 0, "too_short": 0, "nonstandard_aa": 0, "total": 0},
  "queried":  {"hits": 2, "misses": 0, "errors": 0, "total": 2},
}

total == skipped + filtered["total"] + queried["total"]; no count appears twice.

You supply a generic id per sequence; it keys the cache file and maps back to your own records.
out_dir/summaries/{id}.json stores each hit (a 404 miss stores {"structures": []}); existing files are left untouched, so re-runs resume.
Real HTTP errors are counted but not saved, so they retry on the next run.

search_many fetches summaries only. It does not choose a structure and it does not fetch per-residue confidence: which of several exact-sequence matches answers your question is a property of your analysis, not of AFDB, and making that choice inside a batch runner would hide it. Run select_group over the cached summaries and pass its members to fetch_plddt_many when you need per-residue data.

Per-residue pLDDT in bulk

fetch_plddt_many caches the full per-residue array for each structure you chose:

from afdb_query import (
    AlphaFold, select_group, fetch_plddt_many, load_plddt, mean_per_residue,
)
import json, pathlib

cache = pathlib.Path("afdb_cache")

# summaries were cached earlier by search_many. One record per GROUP MEMBER: the id
# keys the cache file, so members of the same group must not collide on it.
records, members = [], {}
for f in (cache / "summaries").glob("*.json"):
    group = select_group(json.loads(f.read_text()).get("structures") or [])
    members[f.stem] = []
    for s in group:
        rid = f"{f.stem}::{s['model_identifier']}"
        members[f.stem].append(rid)
        records.append({"id": rid, "model_url": s["model_url"]})

with AlphaFold() as af:
    fetch_plddt_many(af, records, cache)

# Read one record's group back -- with the SAME composite ids that were written -- and
# average it. mean_per_residue raises if the members disagree in length.
plddts = [p for rid in members["rec1"] if (p := load_plddt(cache, rid))]
consensus = mean_per_residue([p.scores for p in plddts])

plddts[0].mean()             # one definition of "mean pLDDT"
plddts[0].mean(start=42)     # ...and of a region of it

Resumability keys on afdb_cache/plddt/{id}.json, not on the summary — so running summaries first and residues later back-fills every already-cached record instead of skipping it.

Mean pLDDT, and why it takes bounds

mean_plddt(scores, start=None, stop=None) is the single definition. AFDB reports a pre-computed confidence_avg_local_score; other predictors give you only the array. A codebase that reads the field in one place and averages in another has two definitions of its headline number and no way to tell them apart.

Nothing rounds — AFDB's field arrives rounded server-side, so mean_plddt over the same array can differ in the last digit. Round when you format.

Bounds exist because global means are not comparable between a protein and a sub-range of itself. Drop a disordered N-terminal tail and the mean rises purely because low scorers left the set. shared_suffix_means(long, short) does the length-controlled version:

from afdb_query import shared_suffix_means

shared_suffix_means(canonical.scores, truncated.scores)
# {"offset": 71,
#  "shared_long": ...,   # mean over the residues both have — same count each side
#  "shared_short": ...,
#  "displaced": ...}     # mean over the residues only the longer one has

residue_index and is_contiguous are there because per-residue arrays are parallel to residueNumber, and treating array position as residue number is an assumption, not a guarantee.

Averaging a group's arrays

mean_per_residue(arrays, expected_length=None) builds the consensus array for a group select_group returned. It raises on ragged input rather than zip-truncating: unequal lengths make the index set non-rectangular, so position i stops denoting the same residue in every member and every downstream region mean silently describes a comparison that does not exist.

Pass expected_length=len(sequence) to additionally require the arrays be the query's size, and use filter_by_length to drop non-conforming members first.

The commutation this rests on — mean over region of (elementwise mean) equals mean over group of (region mean) — holds because both are unweighted means over the same rectangular index set, and is asserted in the test suite.

Not (yet) supported

UniProt-accession lookup (sequence-only for now)
PAE (Predicted Aligned Error)

Development

pip install -e ".[dev]"
ruff check . && ruff format --check .
pytest                    # unit suite; 100% branch coverage is enforced
pytest -m integration     # live AFDB tests (network required)

The integration suite includes a ground-truth check that the per-residue pLDDT in AFDB's confidence JSON matches the B-factor column of the deposited mmCIF — two independently generated files — so upstream API drift surfaces here rather than in your results.

Project details

Release history Release notifications | RSS feed

This version

0.4.0

Jul 31, 2026

0.2.1

Jul 13, 2026

0.2.0

Jun 2, 2026

0.1.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afdb_query-0.4.0.tar.gz (41.3 kB view details)

Uploaded Jul 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

afdb_query-0.4.0-py3-none-any.whl (21.3 kB view details)

Uploaded Jul 31, 2026 Python 3

File details

Details for the file afdb_query-0.4.0.tar.gz.

File metadata

Download URL: afdb_query-0.4.0.tar.gz
Upload date: Jul 31, 2026
Size: 41.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for afdb_query-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`16fd76c97e7934305a36b737059aabeac4c5eeb7c094707e255cd0404217befd`
MD5	`1f489d2c20907545e73422e8cba1ea17`
BLAKE2b-256	`42df8c0910edb4859174b173dc0f6683e8a8fecd26e24058f2eab518d7ef3c8c`

See more details on using hashes here.

File details

Details for the file afdb_query-0.4.0-py3-none-any.whl.

File metadata

Download URL: afdb_query-0.4.0-py3-none-any.whl
Upload date: Jul 31, 2026
Size: 21.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for afdb_query-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7d1be2fdc178840a49185efd74b0202535b7a7c9ed0688191163dfa6c965f786`
MD5	`6a682e4c55596e7ad51a55604ce59188`
BLAKE2b-256	`cb4da1e5b2208089544e5adae067ea7f8144285b2418e43f5b282a084d90ae67`

See more details on using hashes here.

afdb-query 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

afdb-query

Install

Quickstart

Choosing a structure

Length is not visible to the tiers

Batch lookups

Per-residue pLDDT in bulk

Mean pLDDT, and why it takes bounds

Averaging a group's arrays

Not (yet) supported

Development

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes