Skip to main content

A research-oriented data toolkit for training biomolecular deep-learning foundation models

Project description

Ruff PyPI version Python versions Documentation Status License: BSD 3-Clause

atomworks logo

atomworks is an open-source platform that maximizes research velocity for biomolecular modeling tasks. Much like how Torchvision enables rapid prototyping within the vision domain, and Torchaudio within the audio domain, AtomWorks aims to accelerate development and experimentation within biomolecular modeling.

⚠️ Notice: We are currently finalizing some cleanup work within our repositories. Please expect the APIs (e.g., function and class names, inputs and outputs) to stabilize within the next one week. Thank you for your patience!

If you're looking for the models themselves (e.g., RF3, MPNN) that integrate with AtomWorks rather than the underlying framework, check out ModelForge

💡 Note: Not sure where to start? We've made some examples in the AtomWorks documentation that work through several helpful scenarios; a full tutorial is under construction!

AtomWorks is composed of two symbiotic libraries:

  • atomworks.io: A universal Python toolkit for parsing, cleaning, manipulating, and converting biological data (structures, sequences, small molecules). Built on the biotite API, it seamlessly loads and exports between standard formats like mmCIF, PDB, FASTA, SMILES, MOL, and more. Broadly useful for anyone who works with structural data for biomolecules.
  • atomworks.ml: Advanced dataset featurization and sampling for deep learning workflows that uses atomworks.io as its structural backbone. We provide a comprehensive, pre-built and well-tested set of Transforms for common tasks that can be easily composed into full deep-learning pipelines; users may also create their own Transforms for custom operations.

For more detail on the motivation for and applications of AtomWorks, please see the preprint.

AtomWorks is built atop biotite: We are grateful to the Biotite developers for maintaining such a high-quality and flexible toolkit, and hope that our package will prove a helpful addition to the broader biotite community.


atomworks.io

*A general-purpose Python toolkit for cleaning, standardizing, and manipulating with biomolecular structure files - built atop biotite:

atomworks.io lets you:

  • Parse, convert, and clean any common biological file (structure or sequence). For example, identifying and removing leaving groups, correcting bond order after nucleophilic addition, fixing charges, parsing covalent geometries, and appropriate treatment of structures with multiple occupancies and ligands at symmetry centers
  • Transform all data to a consistent AtomArray representation for further analysis or machine learning applications, regardless of initial source
  • Model missing atoms (those implied by the sequence but not represented in the coordinates) and initialize entity- and instance-level annotations (see the glossary for more detail on our composable naming conventions)

We have found atomworks.io to be generally useful to a broad bioinformatics and protein design audience; in many cases, atomworks.io can replace bespoke scripts and manual curation, enabling researchers to spend more time testing hypothesis and less time juggling dozens of tools and dependencies.


atomworks.ml

Modular, component-based library for dataset featurization within biomolecular deep learning workflows

atomworks.ml provides:

  • A library of pre-built, well-tested Transforms that can be slotted into novel pipelines
  • An extensible framework, integrated with atomworks.io, to write Transforms for arbitrary use cases
  • Pre-built datasets and samplers suitable for most model training scenarios

Within the AtomWorks paradigm, the output of each Transform is not an opaque dictionary with model-specific tensors but instead an updated version of our atom-level structural representation (Biotite's AtomArray). Operations within – and between – pipelines thus maintain a common vocabulary of inputs and outputs.

We have found that atomworks.ml dramatically reduces the overhead of starting, and completing, many ML projects; research topics that once took months now achieve signal within weeks if not days, accelerating the pace of innovation.


When to use atomworks.io vs atomworks.ml?

  • Use atomworks.io when you:

    • Need to parse/clean/convert between biological file formats (mmCIF, PDB, FASTA, etc.)
    • Want a unified structural representation to plug into any downstream analysis or modeling
    • Need structural operations like adding missing atoms, filtering ligands/solvents, or assembly generation
  • Use atomworks.ml when you:

    • Need to featurize entire datasets for deep learning
    • Want ready-made sampling and batching utilities for training pipelines
    • Already use atomworks.io and want a seamless bridge to ML-ready feature engineering

Installation

Note: AtomWorks requires Python >= 3.11 and dotenv

pip install atomworks # base installation version without torch (for only atomworks.io)
pip install "atomworks[ml]" # with torch and ML dependencies (for atomworks.io plus atomworks.ml)
pip install "atomworks[dev]" # with development dependencies
pip install "atomworks[openbabel]" # with [Open Babel](https://openbabel.org/) and its dependencies
pip install "atomworks[ml,openbabel,dev]" # with all dependencies

Running multiple of these installations will just add to the installed dependencies and will not install multiple installations of atomworks.

If you are using uv for package management, you can install atomworks with:

uv pip install "atomworks[ml,openbabel,dev]"

For more advanced setup options (including how to run workflows via apptainers) see the full documentation.


Getting started

This section contains information for how to get atomworks set up and a quick guide for using some of the features of atomworks.io to parse PDB files. To learn more about the features in atomworks.io and atomworks.ml, see the external documentation.

To parse a pdb file (parse = load, clean, annotate relevant metadata such as entities, molecules, etc) you can use the parse function:

Note: To run the code in this section you will need to download the 3nez.cif.gz file yourself. See the examples for how to download files from the PDB within a Python script.

from atomworks.io.parser import parse
from biotite.structure import AtomArrayStack

result = parse(filename="3nez.cif.gz")

asym_unit: AtomArrayStack = result["asym_unit"]
assemblies: dict[str, AtomArrayStack] = result["assemblies"]

for chain_id, info in result["chain_info"].items():
    print(chain_id, info["processed_entity_canonical_sequence"])

The output of parse includes:

  • chain_info — Sequences/metadata for each chain
  • ligand_info — Ligand annotation & metrics
  • asym_unit — Structure (AtomArrayStack)
  • assemblies — Built biological assemblies (each are their own AtomArrayStack)
  • metadata — Experimental and source information

See usage examples for more examples of the use of parse(). All of the provided examples make use of this method. See API reference documentation for more information on this method.

If you just want to load a file, you can use the load_any function:

from atomworks.io.utils.io_utils import load_any
from biotite.structure import AtomArray

atom_array: AtomArray = load_any("3nez.cif.gz", model=1)  # model=1 means that we want to load the model 1 (i.e. the first model) rather than a stack of all models in the file

Contribution

We welcome improvements!

Please see the contributors guide in the full documentation for contribution guidelines.

Acknowledgments

We thank Hope Woods and Rachel Clune from the Rosetta Commons for their partnership and collaboration on the codebase, documentation, tutorials, and examples.

Citation

If you make use of AtomWorks in your research, please cite:

N. Corley*, S. Mathis*, R. Krishna*, M. S. Bauer, T. R. Thompson, W. Ahern, M. W. Kazman, R. I. Brent, K. Didi, A. Kubaney, L. McHugh, A. Nagle, A. Favor, M. Kshirsagar, P. Sturmfels, Y. Li, J. Butcher, B. Qiang, L. L. Schaaf, R. Mitra, K. Campbell, O. Zhang, R. Weissman, I. R. Humphreys, Q. Cong, J. Funk, S. Sonthalia, P. Lio, D. Baker, F. DiMaio, "Accelerating Biomolecular Modeling with AtomWorks and RF3," bioRxiv, August 2025. doi: 10.1101/2025.08.14.670328

If you use bibtex, here's the GoogleScholar formatted citation:

@article{corley2025accelerating,
  title={Accelerating Biomolecular Modeling with AtomWorks and RF3},
  author={Corley, Nathaniel and Mathis, Simon and Krishna, Rohith and Bauer, Magnus S and Thompson, Tuscan R and Ahern, Woody and Kazman, Maxwell W and Brent, Rafael I and Didi, Kieran and Kubaney, Andrew and others},
  journal={bioRxiv},
  pages={2025--08},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atomworks-1.1.0.tar.gz (440.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atomworks-1.1.0-py3-none-any.whl (516.6 kB view details)

Uploaded Python 3

File details

Details for the file atomworks-1.1.0.tar.gz.

File metadata

  • Download URL: atomworks-1.1.0.tar.gz
  • Upload date:
  • Size: 440.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for atomworks-1.1.0.tar.gz
Algorithm Hash digest
SHA256 5891e7e6c5574b2d6362f33fe7642b658dd6c21f66f9f7a8db7a419427578cad
MD5 2aefa6403e1f6ef5182594fc83f73793
BLAKE2b-256 d36e7591ed86416eab014b6445d312da5cea771305c1c83eeadb098989e30e7e

See more details on using hashes here.

File details

Details for the file atomworks-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: atomworks-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 516.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for atomworks-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 afca4d750cee1d2f2ec46be3efef8d7564b2741f0e3f4075e24f19d7c22eba71
MD5 22d1a893435a7f855bde35abd75f5e2d
BLAKE2b-256 400627f1775c31e1f49faee097c258a68403290a30c0d52122da3c448b3023e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page