Rust-first structural bioinformatics toolkit for loading, aligning, analyzing, and preparing structures
Project description
proteon
Python bindings to the proteon structural bioinformatics toolkit (Rust core +
pdbtbx I/O). Thin wrapper over proteon-connector, opinionated surface for
corpus generation, structure preparation, alignment, SASA/DSSP, hydrogen
placement, MSA/template retrieval, supervision-tensor export, and an
experimental structural search stack.
Quickstart: generate a training corpus in 5 lines
from pathlib import Path
import proteon.corpus_smoke as corpus_smoke
corpus_smoke.build_local_corpus_smoke_release(
list(Path("my_pdbs/").glob("*.pdb")),
out_dir="corpus_v0",
release_id="demo-v0",
split_ratios={"train": 0.8, "val": 0.1, "test": 0.1},
n_threads=-1,
overwrite=True,
)
That call handles: parse → batch-prepare (hydrogens, minimization) → per-chain
expansion → sequence-supervision export → structure-supervision export
(AF2 contract) → deterministic hash-split → top-level corpus manifest →
validation report. Failures are captured as machine-readable FailureRecord
rows, not silently dropped.
Output tree:
corpus_v0/
├── corpus/{corpus_release_manifest,validation_report}.json
├── prepared/{prepared_structures.jsonl, supervision_release/...}
├── sequence/{release_manifest.json, examples/{manifest.json,examples.jsonl,tensors.npz}}
└── training/{release_manifest.json, training_examples.jsonl, training.parquet}
The training layer is emitted as Parquet — one row per example with a
ragged residue axis and no outer padding. Writer is row-group chunked (default
512 examples per group) so peak memory stays bounded no matter how big the
corpus grows. The earlier padded-NPZ training format was removed because
max_len × n_examples × fields allocation scaled past practical memory at a
few thousand chains.
A reference artifact (manifests only, large tensors stripped) lives at
examples/sample_corpus_v0/.
Training artifact (Parquet)
The training release is a single training.parquet file, one row per
example. Ragged residue axis uses Arrow list<...>; per-position fixed
dimensions (atom count, coordinate axes, rigid-group frames) use nested
FixedSizeList so readers can reshape losslessly. Any Arrow-compatible
tool — polars, DuckDB, pandas, pyarrow, torch via pyarrow — can read it
without a proteon-specific adapter.
Streaming reader for downstream training loops:
from proteon.training_example import iter_training_examples
for ex in iter_training_examples("corpus_v0/training", split="train"):
positions = ex.structure.all_atom_positions # (L, 37, 3) numpy
phi = ex.structure.phi # (L,) numpy
...
split= applies predicate pushdown — row-groups whose split column
provably doesn't match are skipped by the Parquet reader. No framework
coupling in the public API; wrap it into whatever Dataset your trainer
expects.
Supervision tensors (AF2 contract)
The structure_supervision.v0 release writes tensors.npz with padded
batch-major arrays covering:
| Field | Shape | Purpose |
|---|---|---|
aatype |
(N, L) |
20-class residue identity |
residue_index |
(N, L) |
author residue numbering |
seq_mask |
(N, L) |
valid-residue mask |
all_atom_positions |
(N, L, 37, 3) |
atom37 coordinates |
all_atom_mask |
(N, L, 37) |
atom37 presence |
atom14_gt_positions |
(N, L, 14, 3) |
atom14 ground truth |
atom14_gt_exists, atom14_atom_exists, atom14_atom_is_ambiguous |
(N, L, 14) |
atom14 masks |
residx_atom14_to_atom37, residx_atom37_to_atom14 |
(N, L, ...) |
index maps |
pseudo_beta, pseudo_beta_mask |
(N, L, 3), (N, L) |
CB-or-CA pseudo atoms |
phi, psi, omega + masks |
(N, L) |
backbone torsions |
chi_angles, chi_mask |
(N, L, 4) |
sidechain torsions |
rigidgroups_gt_frames, rigidgroups_gt_exists, rigidgroups_group_exists, rigidgroups_group_is_ambiguous |
(N, L, 8, ...) |
8-group rigid-body frames |
Format is framework-neutral: NumPy .npz + JSONL metadata. Read with
np.load(...) + json.loads(...) — no torch/jax dependency. Consumers build
their own Dataset / DataLoader on top. Loaders in the package
(load_structure_supervision_examples, load_training_examples) return
dicts of np.ndarray.
Split strategies
build_local_corpus_smoke_release accepts three mutually-exclusive modes:
- Default (no split args): all records →
trainexcept the last →val. Only useful for the smallest smoke paths. split_ratios={"train": 0.8, "val": 0.1, "test": 0.1}: deterministic blake2b hash-split onrecord_id. Same record always lands in the same split across runs, corpus sizes, and input orderings. Ratios may be unnormalized — they're renormalized to 1.0.split_assignments={"1crn_A": "train", "1ake_A": "val", ...}: explicit per-record-id mapping. Must cover every (expanded) record_id.
Chosen strategy is recorded in corpus_release_manifest.json → provenance.split_strategy.
Multi-chain handling
Structure/sequence supervision v0 is chain-scoped. The smoke-release helper
expands each loaded structure into one record per chain, so a multi-chain
PDB like 1ake (chains A + B) becomes two records (1ake_A, 1ake_B)
with separate supervision rows, torsions, and rigidgroup frames. Single-chain
structures pass through unchanged with record_id preserved.
Install
Published package path:
pip install proteon
Local checkout path:
cd proteon-connector
maturin develop --release
pip install -e ../packages/proteon/
Search DB serving
The structural search APIs are experimental. They are intended for local prototyping, corpus plumbing, and benchmarks while Proteon's Foldseek-style retrieval layer is still being hardened.
Persisted search DBs now default to writing both:
- the canonical Parquet corpus files
- the eager compiled serving layout used for repeated low-latency queries
Typical path:
import proteon
db = proteon.build_search_db(["1crn.pdb", "1ubq.pdb"], out="search_db", k=6)
hits = proteon.search(proteon.load("1crn.pdb"), "search_db", rerank=False)
If you explicitly want Parquet-only storage, keep it lazy on purpose:
proteon.save_search_db(db, "search_db", write_compiled=False)
lazy = proteon.load_search_db("search_db", prefer_compiled=False)
If you're reopening an older Parquet-only DB and want it upgraded in place
without a separate compile_search_db() call, use:
db = proteon.load_search_db("search_db", auto_compile_missing=True)
# or:
hits = proteon.search(query, "search_db", auto_compile_missing=True)
Known v0 characteristics
rigidgroup_frame_fractionandchi_angle_fractionon raw PDBs run ~40–50% because sidechain atoms and many hydrogens are absent untilbatch_preparefills them in. The validation report exposes both fractions so downstream trainers can filter on completeness.- The split primitive is record-id hashing, not MMseqs2-clustered
redundancy removal. Homologues can land in the same or different
splits depending on the id. For leakage-free splits use
split_assignmentsfrom an external cluster file.
Smaller building blocks
If the 5-line helper isn't the right shape, the same pipeline is available layer-by-layer:
| Function | Layer |
|---|---|
batch_load_tolerant / batch_load_tolerant_with_rescue |
PDB/mmCIF intake |
batch_prepare |
hydrogens + FF-aware minimization |
build_structure_supervision_dataset_from_prepared |
per-chain AF2-contract tensors |
build_sequence_dataset |
per-chain sequence + optional MSA/templates |
build_training_release |
join + split assignment |
build_corpus_release_manifest |
top-level manifest + failure aggregation |
validate_corpus_release |
post-build validation report |
All of these can be called directly with any subset of the pipeline.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proteon-0.1.2.tar.gz.
File metadata
- Download URL: proteon-0.1.2.tar.gz
- Upload date:
- Size: 106.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5aea594410f7354517e311e06b56a3a846820aee7b6815aec60d34110b02acb
|
|
| MD5 |
6e88b5933694f34d7b5eda0c70810ab3
|
|
| BLAKE2b-256 |
81984bcdb647fa3c9e4313ddad3a79f7eb665f4f23981c4cadd5784a40a699b6
|
Provenance
The following attestation bundles were made for proteon-0.1.2.tar.gz:
Publisher:
release.yml on theGreatHerrLebert/proteon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
proteon-0.1.2.tar.gz -
Subject digest:
f5aea594410f7354517e311e06b56a3a846820aee7b6815aec60d34110b02acb - Sigstore transparency entry: 1368443638
- Sigstore integration time:
-
Permalink:
theGreatHerrLebert/proteon@5b5c69f62e0a616b1181b81d80469d01824797f2 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/theGreatHerrLebert
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5b5c69f62e0a616b1181b81d80469d01824797f2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file proteon-0.1.2-py3-none-any.whl.
File metadata
- Download URL: proteon-0.1.2-py3-none-any.whl
- Upload date:
- Size: 122.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a586e278f44913694a956ac0f18b21a648a238b09f972fa1f0238abf7b39168d
|
|
| MD5 |
6f422c8213da97b21d2c505c6873c207
|
|
| BLAKE2b-256 |
37d15dc1031d44bd89524919a6426d931dd2f36dfa83d2b3f33cb176ddab3e81
|
Provenance
The following attestation bundles were made for proteon-0.1.2-py3-none-any.whl:
Publisher:
release.yml on theGreatHerrLebert/proteon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
proteon-0.1.2-py3-none-any.whl -
Subject digest:
a586e278f44913694a956ac0f18b21a648a238b09f972fa1f0238abf7b39168d - Sigstore transparency entry: 1368443650
- Sigstore integration time:
-
Permalink:
theGreatHerrLebert/proteon@5b5c69f62e0a616b1181b81d80469d01824797f2 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/theGreatHerrLebert
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5b5c69f62e0a616b1181b81d80469d01824797f2 -
Trigger Event:
release
-
Statement type: