Make your own protein language model, faster than ever.
Project description
From FASTA to Foundation model — Fast.
🚀 Ship a protein language model without writing a training loop. nanoPLM gives you a batteries‑included CLI, reproducible data workflows, and a simple YAML files to control everything.
🧬 What makes nanoPLM different?
- Control everything with simple YAML files: Prepare your data and Pretrain your model, with YAML files.
- Data you can trust: Using Data Version Control (DVC) under the hood.
- Scale sensibly: Multi‑GPU ready.
🛠️ Install
Clone the repo and then
pip install .
PyPi package comming soon!
🤖 Zero‑to‑model in 4 commands
1. Get data YAML file
nanoplm data get-yaml
You'll get a params.yaml and dvc.yaml files. Just edit the params.yaml if you want.
We're using DVC under the hood, so you can track your data version.
This is the YAML file you get for data preparation.
2. Prepare your data
Use the command below to prepare your data for pLM pretraining (you'll get train and val FASTAs)
nanoplm data from-yaml
By default, this uses
params.yamlin your current directory. You can optionally specify a different path argument (relative or absolute) if needed. Like:nanoplm data from-yaml <path/to/params.yaml>
Or if you want to prepare your data for Knowledge distillation also use the --distillation flag.
This way two extra stages for calculating teacher embeddings for train and val files would also happen.
nanoplm data from-yaml --distillation
📊 Now your data is ready! Let's start the training.
3. Get a pretrain YAML file
nanoplm pretrain get-yaml
This writes the pretraining YAML to your current directory. Prefer a different folder? Use:
nanoplm pretrain get-yaml <output/dir>
4. Start your pretraining
nanoplm pretrain from-yaml
By default, this uses
pretrain.yamlin your current directory. You can optionally specify a different path argument (relative or absolute) if needed.
Data Preparation YAML
data_params:
seqs_num: 20000
min_seq_len: 20
max_seq_len: 512
val_ratio: 0.1
device: "auto"
shuffle_backend: "biopython" # or "seqkit" (faster, but you need to install it)
shuffle: true
shuffle_seed: 24
# If you want to skip some sequences
filter_skip_n: 0
# These are only needed for KNOWLEDGE DISTILLATION, no need to change them if you want to do pretraining only
teacher_model: "prott5"
embed_calc_batch_size: 4
train_shards: 5
val_shards: 2
# Data directories
data_dirs:
url: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz"
# swissprot: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz"
# trembl: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_trembl.fasta.gz"
# uniref50: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz"
# uniref90: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz"
# uniref100: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz"
compressed_fasta: "output/data/raw/uniref50.fasta.gz"
extracted_fasta: "output/data/raw/uniref50.fasta"
shuffled_fasta: "output/data/raw/uniref50_shuffled.fasta"
filtered_fasta: "output/data/filter/uniref50_filtered.fasta"
splitted_fasta_dir: "output/data/split"
# These dirs are only used for KNOWLEDGE DISTILLATION, no need to change them if you want to do pretraining only
kd_train_dir: "output/data/kd_dataset/train"
kd_val_dir: "output/data/kd_dataset/val"
Pretraining YAML
# Pretraining configuration for nanoPLM
model:
hidden_size: 1024
intermediate_size: 2048
num_hidden_layers: 16
num_attention_heads: 16
vocab_size: 29
mlp_activation: "swiglu"
mlp_dropout: 0.0
mlp_bias: False
attention_bias: False
attention_dropout: 0.0
classifier_activation: "gelu"
pretraining:
# Dataset
# Note: these paths are RELATIVE to where you RUN the command NOT the YAML file.
train_fasta: "output/data/split/train.fasta"
val_fasta: "output/data/split/val.fasta"
# Output model path
ckp_dir: "output/pretraining_checkpoints"
# Hyperparameters
max_length: 512
batch_size: 32
num_epochs: 10
# Info for lazy dataset loading
# True: Low memory usage, tokenize on-demand (slower iteration, faster startup)
# False: High memory usage, tokenize all sequences at once (faster iteration, slower startup)
lazy_dataset: False
warmup_ratio: 0.05
optimizer: "adamw" # adamw, stable_adamw
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-8
learning_rate: 3e-6
weight_decay: 0.0
gradient_accumulation_steps: 1
mlm_probability: 0.3
mask_replace_prob: 0.8
random_token_prob: 0.1
keep_probability: 0.1
logging_steps_percentage: 0.01 # 100 logging in total
eval_steps_percentage: 0.025 # 40 evaluations in total
save_steps_percentage: 0.1 # 10 saves in total
seed: 42
num_workers: 0
multi_gpu: False
world_size: 1 # Use "auto" if you want to use all available GPUs
run_name: "nanoplm-pretraining"
Tip: Paths are resolved relative to where you run the command (not where the YAML lives).
Requirements
- Python 3.10+
- macOS or Linux
- GPU recommended (CPU is fine for tiny tests)
Contributing
PRs welcome. If you’re unsure where to start, open an issue with your use‑case.
Like it? Star it.
If nanoPLM saved you time, a star helps others find it and keeps development going.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nanoplm-0.1.0.tar.gz.
File metadata
- Download URL: nanoplm-0.1.0.tar.gz
- Upload date:
- Size: 68.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c69c6d8fb70f6617d145d48147fde18d90f1c49d12ba13937af7ebb87b184d41
|
|
| MD5 |
68222f5db9754d9bb39dc620a0121f7f
|
|
| BLAKE2b-256 |
2276e1a386238fbf4710bf5ee2fa1084adefff16d57bba921aacd10338fd3225
|
Provenance
The following attestation bundles were made for nanoplm-0.1.0.tar.gz:
Publisher:
publish-to-pypi.yml on heispv/nanoplm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nanoplm-0.1.0.tar.gz -
Subject digest:
c69c6d8fb70f6617d145d48147fde18d90f1c49d12ba13937af7ebb87b184d41 - Sigstore transparency entry: 584978378
- Sigstore integration time:
-
Permalink:
heispv/nanoplm@fdcff68dd272d9fd5ab6ea27b6bcef9712fc2ed6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/heispv
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@fdcff68dd272d9fd5ab6ea27b6bcef9712fc2ed6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file nanoplm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nanoplm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 65.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6994481e4e638a410c65d08a22e2409476c880ad09e6761d17f8401567b5c9a6
|
|
| MD5 |
2e2841d432557142dcba356c8f466a8a
|
|
| BLAKE2b-256 |
190291c64b6b4149289453dcdada6ddfa59de098437c122c2ae3a1a4def46731
|
Provenance
The following attestation bundles were made for nanoplm-0.1.0-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on heispv/nanoplm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nanoplm-0.1.0-py3-none-any.whl -
Subject digest:
6994481e4e638a410c65d08a22e2409476c880ad09e6761d17f8401567b5c9a6 - Sigstore transparency entry: 584978379
- Sigstore integration time:
-
Permalink:
heispv/nanoplm@fdcff68dd272d9fd5ab6ea27b6bcef9712fc2ed6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/heispv
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@fdcff68dd272d9fd5ab6ea27b6bcef9712fc2ed6 -
Trigger Event:
push
-
Statement type: