Skip to main content

A high level programming language for generative biology

Project description

Proto Language

Proto Tools

Checks Unit Tests License: MIT Docs

Welcome! This repository contains the open-source implementation of proto-language, a Python package for designing biological sequences (DNA, RNA, and proteins) through constraint-based optimization. A design is specified as a set of constraints, and the framework runs a propose–score–refine loop to search for sequences that satisfy them, drawing on a large suite of computational biology and biological AI tools to score candidates.

proto-language is built on top of the proto-tools execution layer, so each computationally intensive tool (structure predictors, protein language models, inverse folding, sequence and structure aligners, gene annotation, and more) runs in its own automatically managed, isolated environment. Programs can run locally or as hosted optimization runs through the proto-client Python SDK.

Proto-language is open source under an MIT license. Contributions are welcome!

Setup

Step 1: Install the package

The package requires Python 3.10 or later and pip:

pip install git+https://github.com/evo-design/proto-language.git

System tools that standalone tool environments require in order to build (git, curl, gcc, make, cmake) are automatically provisioned on first use through proto-tools' shared foundation environment, so no manual setup is necessary.

[!NOTE] A direct PyPI install (pip install proto-language) is planned.

[!NOTE] Contributors should instead use the editable installation described in CONTRIBUTING.md.

Step 2: Configure storage (optional)

All persistent data (model weights, tool environments, micromamba) is stored under PROTO_HOME, which defaults to ~/.proto/ and is inherited from proto-tools.

To customize the storage location (recommended for laboratory and HPC environments):

# Add to your shell profile:
export PROTO_HOME=/path/to/your/proto_home

To override only the model-weights location, set export PROTO_MODEL_CACHE=/path/to/shared/weights. See notes/filesystem.md for all options.

Step 3: Gated model access (optional)

Some generators and constraints load gated models (for example ESM3, AlphaGenome, and AlphaFold3) that require accepting a license and authenticating with HuggingFace. Set HF_TOKEN in the environment after accepting each model's terms. See proto-tools/README.md for the full procedure and the list of gated models.

[!TIP] Setup is complete. See the Quickstart to run a program from end to end.

Quickstart

Working programs are provided under examples/:

  • examples/scripts/ — runnable Python programs, ranging from a minimal end-to-end example (toy.py) to broader workloads.
  • examples/jsons/ — declarative JSON program definitions (the optimization_stages schema). These illustrate program structure and are not loaded by a Python consumer.

Architecture

The framework is built around seven primitives in proto_language/core/ — three data containers, three pluggable interfaces, and one orchestrator:

  • Sequence — a typed string (DNA, RNA, or protein) together with optional logits, a folded structure, and namespaced metadata. The atomic unit of design.
  • Segment — a single design region. It holds the proposal Sequences for that region and the surviving result Sequences after scoring.
  • Construct — an ordered list of Segments that concatenate into a full biological construct (for example, a promoter plus a coding region; a multi-chain protein; or a designed gene).
  • Constraint (registered via @constraint) — scores a Sequence against a target property, returning a score and namespaced metadata, and may optionally provide gradients.
  • Generator (registered via @generator) — proposes new Sequences for a Segment.
  • Optimizer (registered via @optimizer) — a search strategy that drives the propose–score–refine loop.
  • Program — the top-level orchestrator. It owns the Construct and composes one or more Optimizer stages.

All three pluggable interfaces share a BaseConfig Pydantic configuration pattern and declare parameters with ConfigField.

The optimization loop

Program.run() iterates through its optimizer stages. Each stage performs the following steps:

  1. The Optimizer requests proposal Sequences from its Generator for one or more Segments.
  2. Each Constraint evaluates the proposals and records its score and metadata on the proposal Sequences.
  3. The Optimizer aggregates the constraint scores and selects survivors. These become the Segment's result Sequences and feed into the next iteration, or the next stage.

When the program finishes, Program.export(path=...) writes a directory containing tables for sequences, constraints, constructs, and optimization steps, a FASTA file, and an assets/ sidecar directory.

Development & Contributing

See CONTRIBUTING.md for developer setup, code style, testing, and agent conventions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proto_language-0.1.0.tar.gz (391.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proto_language-0.1.0-py3-none-any.whl (488.2 kB view details)

Uploaded Python 3

File details

Details for the file proto_language-0.1.0.tar.gz.

File metadata

  • Download URL: proto_language-0.1.0.tar.gz
  • Upload date:
  • Size: 391.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for proto_language-0.1.0.tar.gz
Algorithm Hash digest
SHA256 68a9b3739b3eb2f6995b476846a228d458cdd70e37bf52386ed4d57b683d4bba
MD5 e462da4285e7da58cd2ad3a1ef773203
BLAKE2b-256 a633a50f4cd94b7669e8ef3197e808c7745161987a417063df6917331d3a0fec

See more details on using hashes here.

File details

Details for the file proto_language-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: proto_language-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 488.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for proto_language-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab67615483e372e36fc650f3064f033403da1faa8e84fb97789d366a0b207897
MD5 febf2d144b28ad2d41766a316ba3fd9d
BLAKE2b-256 5823407988f0c7471e4e5607504025c7dd2e9abadb76c644c5a689635d430975

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page