Protein family language models

These details have not been verified by PyPI

Project links

Project description

ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

ProFam is an open-source toolkit for training, scoring, and generating protein sequences with protein family language models. It packages the ProFam-1 251M-parameter pfLM together with open training and inference workflows, a downloadable pretrained checkpoint, and an open dataset release for reproducible experimentation.

Installation

From PyPI

Install ProFam as a standard Python package:

uv pip install profam

pip install profam

From Source

If you want the full repository workflows, example data, and inference scripts:

git clone https://github.com/alex-hh/profam.git
cd profam
uv sync
uv run profam-download-checkpoint

Optional installs:

Development tooling: uv sync --group dev
FlashAttention 2: uv sync --extra flash-attn

If you run into CUDA or flash-attn issues, see Installation Details.

Quickstart

Verify the installed package

uv run --with profam --no-project -- python -c "import profam; print(profam.__version__)"

Run a lightweight training example

The bundled example config uses the small dataset under data/train_example:

uv run profam-train experiment=train_profam_example logger=null_logger

Download the pretrained checkpoint

uv run profam-download-checkpoint

Main Workflows

Workflow	Purpose	Command
Train	Train a ProFam model with Hydra configs	`uv run profam-train`
Example training	Run a lightweight smoke test on example data	`uv run profam-train experiment=train_profam_example logger=null_logger`
Model summary	Print a model architecture summary	`uv run profam-model-summary`
Download checkpoint	Fetch the pretrained `ProFam-1` checkpoint	`uv run profam-download-checkpoint`
Generate sequences	Sample new sequences from family prompts	`uv run profam-generate-sequences ...`
Score sequences	Score candidate sequences with family context	`uv run profam-score-sequences ...`

The packaged CLI now covers the main package entrypoints, including training, checkpoint download, sequence generation, and sequence scoring.

Input Sequence Formats

ProFam supports:

Unaligned FASTA for standard protein sequence inputs
Aligned / MSA-style files such as A2M/A3M content with gaps and insertions

For profam-score-sequences, we recommend providing an aligned MSA file because sequence weighting is used to encourage diversity when subsampling prompt sequences. Even when aligned inputs are provided, the standard ProFam model converts them into unaligned gap-free sequences before the forward pass.

During preprocessing:

gaps (- and alignment-like .) are removed
lowercase insertions are converted to uppercase
U -> C and O -> K
remaining out-of-vocabulary characters map to [UNK] only when allow_unk=true

Training

Run a lightweight example

configs/experiment/train_profam_example.yaml is configured to run on the bundled example data:

uv run profam-train experiment=train_profam_example logger=null_logger

Train with the ProFam-Atlas dataset

Training data for ProFam can be downloaded from:

Zenodo: ProFam Atlas Dataset

The default configuration in configs/train.yaml is compatible with the latest ProFam-Atlas release:

uv run profam-train

Resources

Citation

If you use ProFam in your work, please cite the preprint:

@article{wells2025profam,
  title = {ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design},
  author = {Wells, Jude and Hawkins Hooker, Alex and Livne, Micha and Lin, Weining and Miller, David and Dallago, Christian and Bordin, Nicola and Paige, Brooks and Rost, Burkhard and Orengo, Christine and Heinzinger, Michael},
  journal = {bioRxiv},
  year = {2025},
  doi = {10.64898/2025.12.19.695431},
  url = {https://www.biorxiv.org/content/10.64898/2025.12.19.695431v1}
}

Installation Details

CPU-only installation

uv sync
uv pip install torch --index-url https://download.pytorch.org/whl/cpu

FlashAttention 2

We recommend installing FlashAttention 2 for faster scoring and generation. For training, it is strongly recommended because ProFam uses sequence packing with batch_size=1 and no padding.

If you need to train without Flash Attention, update the configuration to set data.pack_to_max_tokens=null.

uv sync --extra flash-attn
python -c "import flash_attn; print(flash_attn.__version__)"

Troubleshooting: conda fallback

If a matching flash-attn wheel is unavailable and a source build is required, this conda-based fallback is often the easiest route:

conda create -n pfenv python=3.11 -y
conda activate pfenv

conda install -c conda-forge ninja packaging -y
conda install -c nvidia cuda-toolkit=12.4 -y

pip install profam

# install a CUDA-enabled PyTorch build (adjust CUDA version/index-url to match your setup)
pip install torch==2.5.1+cu121 torchvision==0.20.1+cu121 --index-url https://download.pytorch.org/whl/cu121

pip install setuptools wheel packaging psutil numpy
pip install flash-attn==2.5.6 --no-build-isolation

python -c "import flash_attn; print(flash_attn.__version__)"

Development

We're using pre-commit to format code and pytest to run tests.

Pull requests will automatically have pre-commit and pytest run on them and will only be approved once these checks are all passing

Before submitting a pull request, run the checks locally with:

uv run --group dev pre-commit run --all-files

and

uv run --group dev pytest -k 'not example'

Pull requests adding complex new features or making any significant changes or additions should be accompanied with associated tests in the tests/ directory.

Concepts

Data loading

ProFam uses text memmap datasets for fast random access over large corpora:

profam/data/text_memmap_datasets.py: generic memory-mapped line access + index building (*.idx.{npy,info})
profam/data/builders/family_text_memmap_datasets.py: ProFam-Atlas-specific datasets built on top of the memmap layer

ProFam-Atlas on-disk format (`.mapping` / `.sequences`)

The ProFam-Atlas dataset is distributed as paired files:

*.mapping: family id + indices into one or more *.sequences files
- Format:
  - Line 1: >FAMILY_ID
  - Line 2+: sequences_filename:idx0,idx1,idx2,...
- Important: *.mapping files must not have a trailing newline at end-of-file.
*.sequences: FASTA-like accessions + sequences
- Format (repeated):
  - >ACCESSION ...
  - SEQUENCE
- Important: *.sequences files should have a final trailing newline.

See README_ProFam_atlas.md for examples and additional details.

How it’s loaded

At a high level, training loads one protein family at a time by:

Reading a family record from MappingProteinFamilyMemmapDataset (a memmapped *.mapping dataset)
Fetching the referenced sequences from SequencesProteinFamilyMemmapDataset (memmapped *.sequences files)
Building a ProteinDocument and preprocessing it (see profam/data/processors/preprocessing.py)
Encoding with ProFamTokenizer and forming batches (optionally with packing)

Converting FASTA → text memmap

If you have a directory of per-family FASTA files and want to create *.mapping / *.sequences files for training, see:

data_creation_scripts/fasta_to_text_memmap.py

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.8

Mar 11, 2026

0.1.7

Mar 11, 2026

0.1.6

Mar 11, 2026

0.1.5

Mar 11, 2026

0.1.4

Mar 11, 2026

0.1.3

Mar 11, 2026

0.1.2

Mar 10, 2026

This version

0.1.1

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

profam-0.1.1.tar.gz (532.4 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

profam-0.1.1-py3-none-any.whl (148.5 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file profam-0.1.1.tar.gz.

File metadata

Download URL: profam-0.1.1.tar.gz
Upload date: Mar 10, 2026
Size: 532.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for profam-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`870dc9eaf659f7d6c0b90dadf93abe4ab8a03e5fd6f1217484d9cd36dc4d6f6e`
MD5	`cb3a5b65af1b976e57f9d2ff87ca5a54`
BLAKE2b-256	`225ea9a912c9fc568f76e3adceddfba822545c9f3406a0a622b5f22ce8081b19`

See more details on using hashes here.

File details

Details for the file profam-0.1.1-py3-none-any.whl.

File metadata

Download URL: profam-0.1.1-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 148.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for profam-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`40168e9fa7caadf3aa21939b2e81c1e3f47152b61402b51d55d0942887823a9e`
MD5	`ec07eb857643334672114ec1ec3b4de1`
BLAKE2b-256	`5968a08d46f15420f2ef05bf197061d4ad9ecdf91616d063f229fe3fa5a81bf0`

See more details on using hashes here.

profam 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

Installation

From PyPI

From Source

Quickstart

Verify the installed package

Run a lightweight training example

Download the pretrained checkpoint

Main Workflows

Input Sequence Formats

Training

Run a lightweight example

Train with the ProFam-Atlas dataset

Resources

Citation

Installation Details

CPU-only installation

FlashAttention 2

Troubleshooting: conda fallback

Development

Concepts

Data loading

ProFam-Atlas on-disk format (.mapping / .sequences)

How it’s loaded

Converting FASTA → text memmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

ProFam-Atlas on-disk format (`.mapping` / `.sequences`)