Skip to main content

Unified data processing for AlphaFold3-like models

Project description

UniAF3

Prepare inputs and process outputs for AlphaFold3-like models, including AlphaFold3, Boltz, Chai-1, and Protenix-v1.

UniAF3 provides a unified YAML-based input format that serves as a common intermediate representation for converting between different AlphaFold3-family structure prediction models. The format supports specifying molecular sequences, restraints, and inference parameters in a single configuration file.

Feature Support

The following table summarizes feature support across all models:

Feature UniAF3 AlphaFold3 AF3 Server Boltz Chai-1 Protenix
Sequences
Protein chains
DNA chains
RNA chains
Ligands (CCD) ✅ (limited set) ✅ (single CCD only) ⚠️ (converted to SMILES) ✅ (multi-CCD supported)
Ligands (SMILES)
Ligands (file path)
Ligands (user CCD) ✅ (user-provided CCD)
Multi-CCD ligands
Glycans ✅ (Chai notation) ⚠️ (as multi-CCD ligands with bonds) ⚠️ (single sugar only) ⚠️ (as multi-CCD ligand)
Ions ✅ (as CCD ligand) ✅ (as CCD ligand) ✅ (dedicated type) ✅ (as CCD ligand) ✅ (dedicated type)
Homomeric copies ✅ (via id list) ✅ (via id list) ✅ (via count) ✅ (via id list) ❌ (separate entities) ✅ (via count)
Modifications
Protein PTMs ✅ (limited CCD set) ✅ (inline CCD)
DNA modifications ✅ (limited CCD set) ✅ (inline CCD)
RNA modifications ✅ (limited CCD set) ✅ (inline CCD)
Cyclic polymers ✅ (Boltz-specific)
MSA & Templates
Custom MSA ✅ (via msa_dir) ✅ (inline or path) ✅ (CSV or A3M) ✅ (via msa_directory) ✅ (path)
Paired MSA ✅ (CSV key column)
Structural templates ✅ (mmCIF) ✅ (CIF/PDB) ✅ (via server) ✅ (A3M/HHR)
Restraints
Covalent bonds
Contact restraints
Pocket restraints
Inference Parameters
Random seeds ✅ (can be empty) ❌ (CLI arg) ✅ (single seed) ❌ (CLI arg)
Recycling steps ❌ (CLI arg) ❌ (CLI arg) ❌ (CLI arg)
Diffusion steps ❌ (CLI arg) ❌ (CLI arg) ❌ (CLI arg)
Diffusion samples ❌ (CLI arg) ❌ (CLI arg) ❌ (CLI arg)
Affinity prediction ✅ (Boltz-specific)

Legend: ✅ = fully supported, ⚠️ = partially supported / lossy conversion, ❌ = not supported

CLI Usage

Validate a config

Validate an input config file and print its contents:

uniaf3 validate INPUT_CONFIG_FILE [--format FORMAT]

Arguments:

  • INPUT_CONFIG_FILE — Path to the config file to validate (required).

Options:

  • --format, -f — Format of the input config file (default: uniaf3). Supported values: uniaf3, alphafold3, alphafold3server, boltz, chai, protenix.

Examples:

# Validate a UniAF3 config
uniaf3 validate input.yaml

# Validate a Boltz config
uniaf3 validate boltz_input.yaml --format boltz

# Validate an AlphaFold3 JSON
uniaf3 validate af3_input.json -f alphafold3

For Chai-1 configs, if a .restraints or .csv file with the same stem exists alongside the FASTA file, it will be loaded automatically.

Convert between formats

Convert an input config file from one format to another:

uniaf3 convert INPUT_CONFIG_FILE OUTPUT_DIR [PREFIX] [--from-format FORMAT] [--to-format FORMAT]

Arguments:

  • INPUT_CONFIG_FILE — Path to the input config file (required).
  • OUTPUT_DIR — Directory for the output config file(s) (required).
  • PREFIX — Prefix for output file name(s). Defaults to the input file name without extension.

Options:

  • --from-format, -f — Source format (default: uniaf3).
  • --to-format, -t — Target format (default: alphafold3).

Examples:

# UniAF3 → AlphaFold3
uniaf3 convert input.yaml output_dir/ --from-format uniaf3 --to-format alphafold3

# Boltz → Chai-1
uniaf3 convert boltz_input.yaml output_dir/ --from-format boltz --to-format chai

# AF3 → Protenix
uniaf3 convert af3_input.json output_dir/ --from-format alphafold3 --to-format protenix

Input Format

UniAF3 configs are written in YAML. The top-level structure is:

sequences:
  - # Polymer, Ligand, or Glycan entries
covalent_bonds:   # Optional
  - # CovalentBond entries
contact_restraints:   # Optional
  - # ContactRestraint entries
pocket_restraints:   # Optional
  - # PocketRestraint entries
aux:   # Optional, inference parameters
  seeds:
    - 42
  num_trunk_recycles: 3
  num_diffn_timesteps: 200
  num_diffn_samples: 5
  num_trunk_samples: 1

Sequences

Each entry in the sequences list must be one of four types:

Protein

Proteins use the ProteinSeq schema (which extends Polymer) and support MSA directories and structural templates.

- polymer_type: protein
  id: A                         # or [A, B] for homomeric copies
  sequence: MVLSPADKTNVK       # Standard 1-letter amino acid codes
  description: "My protein"     # Optional description
  modifications:                # Optional PTMs
    - ccd: HY3                  # CCD code of modification
      position: 1               # 1-based residue index
  msa_dir: path/to/msa/         # Optional, directory containing MSA files
  templates:                    # Optional structural templates
    - path: template.cif        # Path to mmCIF or PDB file
      query_idx: [0, 1, 2]      # 0-based query residue indices
      template_idx: [0, 1, 2]   # 0-based template residue indices
      query_chains: [A]         # Optional, chain IDs in query
      template_chains: [A]      # Optional, chain IDs in template
      boltz_enable_force: false  # Boltz-specific: enforce template
      boltz_template_threshold: null  # Boltz-specific: deviation threshold (Å)
  boltz_cyclic: false           # Boltz-specific: cyclic polymer flag

MSA Directory Structure:

The msa_dir field points to a directory with the following expected structure:

msa_dir/
  a3ms/
    {seq_hash}.single.a3m    # Unpaired MSA
    {seq_hash}.pair.a3m      # Paired MSA (optional)

Where {seq_hash} is the SHA-256 hex digest of the protein sequence. This follows the Chai-1 MSA search output convention.

DNA

- polymer_type: dna
  id: C
  sequence: GATTACA        # Only A, T, G, C allowed
  modifications:           # Optional
    - ccd: 6OG
      position: 1

RNA

- polymer_type: rna
  id: D
  sequence: AGCU           # Only A, U, G, C allowed
  modifications:           # Optional
    - ccd: 2MG
      position: 1

Ligand

Ligands must specify exactly one of ccd (a list of CCD codes) or smiles:

# CCD ligand (single or multi-CCD)
- id: E
  ccd:
    - ATP

# Multi-CCD ligand (e.g., glycan as ligand)
- id: F
  ccd:
    - NAG
    - BMA

# SMILES ligand
- id: G
  smiles: "CC(=O)OC1C[NH+]2CCC1CC2"

Glycan

Glycans use Chai-1's glycan notation (modified CCD codes with bond information):

- id: H
  chai_str: "NAG(4-1 NAG(4-1 BMA(3-1 MAN)(6-1 MAN)))"
  description: "Branched glycan"

For single sugars without bonds: chai_str: NAG

Chain IDs

Chain IDs (id field) serve as unique identifiers for each entity. They can be:

  • A single string: id: A
  • A list of strings for homomeric copies: id: [A, B, C]

Chain IDs are used to reference entities in restraints. When converting to models that use count-based copies (AF3 Server, Protenix), the number of IDs in the list determines the copy count.

The chain ID naming convention follows standard spreadsheet-style ordering: A, B, ..., Z, AA, AB, AC, ..., AZ, BA, BB, ...

This is generated by the int_to_letters() function (1-indexed): int_to_letters(1)A, int_to_letters(27)AA, int_to_letters(28)AB.

Note: The open-source AlphaFold3 documentation uses a "reverse spreadsheet style" ordering (AA, BA, CA, ...). UniAF3 standardizes on the conventional spreadsheet ordering for internal consistency across all adapters.

Restraints

Covalent Bonds

Specify covalent bonds between atoms from different entities:

covalent_bonds:
  - atom1:
      chain_id: A           # Entity ID
      residue_idx: 5        # 1-based residue index (0 for ligands)
      atom_name: CG         # Atom name (e.g., CA, N, SG)
      residue_name: P       # Optional, for validation
    atom2:
      chain_id: E           # Entity ID
      residue_idx: 1        # 1-based position within ligand
      atom_name: C04        # Atom name in the ligand
      residue_name: null    # Not required for ligands
    description: "Optional description"

Notes:

  • atom_name is required for both atoms.
  • residue_name is used by Chai-1 for validation and restraint formatting.
  • For ligands, residue_idx is typically 1 for single-CCD or SMILES ligands.
  • Ligand atom names follow RDKit naming conventions.

Contact Restraints

Distance restraints between two atoms/residues:

contact_restraints:
  - token1:
      chain_id: A
      residue_idx: 10       # 1-based, or 0 if atom_name is used for ligands
      atom_name: null        # Optional for polymers, required for ligands
      residue_name: K        # Optional, for validation
    token2:
      chain_id: C
      residue_idx: 5
      atom_name: null
      residue_name: null
    max_distance: 8.0        # Maximum distance in Å (must be 4-20 Å)
    min_distance: 0.0        # Minimum distance in Å (Protenix only)
    boltz_enable_force: true  # Boltz-specific: enforce with potential

Notes:

  • max_distance must be between 4.0 and 20.0 Å (Boltz requirement, applied universally).
  • min_distance is only used by Protenix.
  • AF3 and AF3 Server do not support contact restraints.

Pocket Restraints

Specify a binding pocket where a binder chain interacts with specific contact residues:

pocket_restraints:
  - binder_chain: E          # ID of the chain binding to the pocket
    contact_tokens:           # List of residues forming the pocket
      - chain_id: A
        residue_idx: 10
        atom_name: null       # For polymers; use atom_name for ligands
        residue_name: K
      - chain_id: A
        residue_idx: 15
        atom_name: null
        residue_name: G
    max_distance: 6.0         # Maximum distance in Å (4-20 Å)
    min_distance: 0.0         # Protenix only
    boltz_enable_force: false  # Boltz-specific: enforce with potential

Notes:

  • Contact tokens must NOT be on the same chain as binder_chain.
  • Protenix supports only a single pocket constraint per job.
  • AF3 and AF3 Server do not support pocket restraints.

Inference Parameters

The aux field contains optional inference parameters:

aux:
  num_trunk_recycles: 3         # Default: 3
  num_diffn_timesteps: 200      # Default: 200
  num_diffn_samples: 5          # Default: 5
  num_trunk_samples: 1          # Default: 1
  name: "job_name"              # Optional, used in AF3 Server
  boltz_affinity_binder_chain: D  # Boltz-specific: affinity binder chain ID

Seeds

Seeds are stored in aux.seeds as a list of integer random seeds:

aux:
  seeds:
    - 42
    - 123
  • AF3 uses all seeds directly.
  • Chai-1 uses only the first seed; additional seeds are applied via num_trunk_samples.
  • Boltz and Protenix do not store seeds in their config format; default [42] is used on import.

Validation Rules

The UniAF3 schema enforces these validation rules:

  1. At least one sequence must be provided.
  2. Modification positions must be within the sequence length.
  3. Ligands must specify exactly one of ccd or smiles.
  4. Covalent bond atoms must have non-null atom_name.
  5. Contact restraints require max_distance between 4.0 and 20.0 Å, and max_distance > min_distance.
  6. Pocket restraint contact tokens must not be on the same chain as binder_chain.
  7. Restraint atoms must reference valid chain IDs, and residue indices must be within the sequence length.
  8. Residue names in restraints (when provided) are validated against the sequence.

Complete Example

sequences:
  - polymer_type: protein
    id: [A, B]
    sequence: MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLS
    msa_dir: dummy_msa/
    modifications:
      - ccd: HY3
        position: 1
    description: Hemoglobin subunit
  - polymer_type: dna
    id: C
    sequence: GATTACA
  - id: D
    ccd:
      - ATP
  - id: E
    smiles: "CC(=O)OC1C[NH+]2CCC1CC2"
  - id: F
    chai_str: NAG
    description: Example glycan

covalent_bonds:
  - atom1:
      chain_id: B
      residue_idx: 2
      atom_name: CA
      residue_name: V
    atom2:
      chain_id: D
      residue_idx: 1
      atom_name: C04
      residue_name: null

contact_restraints:
  - token1:
      chain_id: A
      residue_idx: 5
      atom_name: CG
      residue_name: P
    token2:
      chain_id: B
      residue_idx: 5
      atom_name: null
      residue_name: P
    max_distance: 8.0
    boltz_enable_force: true

pocket_restraints:
  - binder_chain: D
    max_distance: 6.0
    contact_tokens:
      - chain_id: A
        residue_idx: 10
        atom_name: null
        residue_name: N
      - chain_id: B
        residue_idx: 3
        atom_name: null
        residue_name: L

aux:
  seeds:
    - 42
    - 123
  num_trunk_recycles: 3
  num_diffn_timesteps: 200
  num_diffn_samples: 5
  num_trunk_samples: 1
  boltz_affinity_binder_chain: D

Model-specific Documentation

For detailed documentation on each model's native input format, see:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniaf3-0.2.0.tar.gz (1.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uniaf3-0.2.0-py3-none-any.whl (2.0 MB view details)

Uploaded Python 3

File details

Details for the file uniaf3-0.2.0.tar.gz.

File metadata

  • Download URL: uniaf3-0.2.0.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uniaf3-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d4a5fb1fb4e289b2476d173c879d3f8ada820378809eeb6e57584503227fd727
MD5 5540d25d8d22511105807935a12dd04c
BLAKE2b-256 0f5a40e18b188a73e01787aaf8665e673c825f5404dfdad824e1ece11d46e9fa

See more details on using hashes here.

File details

Details for the file uniaf3-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: uniaf3-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uniaf3-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6af42cc524b540165577d106a0ca41eb1074abb5d4c8c9151f3ba0f08e47c0c0
MD5 386b92c19d1768627900f9547fa2e22f
BLAKE2b-256 054f2e0e52ff93aa3ecd4e4cf9095b6b289049cd9423d8af9e9d5cb62d2b535e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page