nanoplm

Make your own protein language model, faster than ever.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

heispv

These details have not been verified by PyPI

Project description

From FASTA to Foundation model — Fast.

🚀 Ship a protein language model without writing a training loop. nanoPLM gives you a batteries‑included CLI, reproducible data workflows, and a simple YAML files to control everything.

🧬 What makes nanoPLM different?

Control everything with simple YAML files: Prepare your data and Pretrain your model, with YAML files.
Data you can trust: Using Data Version Control (DVC) under the hood.
Scale sensibly: Multi‑GPU ready.

🛠️ Install

Clone the repo and then

pip install .

PyPi package comming soon!

🤖 Zero‑to‑model in 4 commands

1. Get data YAML file

nanoplm data get-yaml

You'll get a params.yaml and dvc.yaml files. Just edit the params.yaml if you want.

We're using DVC under the hood, so you can track your data version.

This is the YAML file you get for data preparation.

2. Prepare your data

Use the command below to prepare your data for pLM pretraining (you'll get train and val FASTAs)

nanoplm data from-yaml

By default, this uses params.yaml in your current directory. You can optionally specify a different path argument (relative or absolute) if needed. Like: nanoplm data from-yaml <path/to/params.yaml>

Or if you want to prepare your data for Knowledge distillation also use the --distillation flag. This way two extra stages for calculating teacher embeddings for train and val files would also happen.

nanoplm data from-yaml --distillation

📊 Now your data is ready! Let's start the training.

3. Get a pretrain YAML file

nanoplm pretrain get-yaml

This writes the pretraining YAML to your current directory. Prefer a different folder? Use: nanoplm pretrain get-yaml <output/dir>

4. Start your pretraining

nanoplm pretrain from-yaml

By default, this uses pretrain.yaml in your current directory. You can optionally specify a different path argument (relative or absolute) if needed.

Data Preparation YAML

data_params:
  seqs_num: 20000
  min_seq_len: 20
  max_seq_len: 512
  val_ratio: 0.1

  device: "auto"
  
  shuffle_backend: "biopython" # or "seqkit" (faster, but you need to install it)
  shuffle: true
  shuffle_seed: 24

  # If you want to skip some sequences
  filter_skip_n: 0

  # These are only needed for KNOWLEDGE DISTILLATION, no need to change them if you want to do pretraining only
  teacher_model: "prott5"
  embed_calc_batch_size: 4
  train_shards: 5
  val_shards: 2

# Data directories
data_dirs:
  url: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz"
  # swissprot: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz"
  # trembl: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_trembl.fasta.gz"
  # uniref50: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz"
  # uniref90: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz"
  # uniref100: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz"
  compressed_fasta: "output/data/raw/uniref50.fasta.gz"
  extracted_fasta: "output/data/raw/uniref50.fasta"

  shuffled_fasta: "output/data/raw/uniref50_shuffled.fasta"

  filtered_fasta: "output/data/filter/uniref50_filtered.fasta"
  splitted_fasta_dir: "output/data/split"

  # These dirs are only used for KNOWLEDGE DISTILLATION, no need to change them if you want to do pretraining only
  kd_train_dir: "output/data/kd_dataset/train"
  kd_val_dir: "output/data/kd_dataset/val"

Pretraining YAML

# Pretraining configuration for nanoPLM

model:
  hidden_size: 1024
  intermediate_size: 2048
  num_hidden_layers: 16
  num_attention_heads: 16
  vocab_size: 29
  mlp_activation: "swiglu"
  mlp_dropout: 0.0
  mlp_bias: False
  attention_bias: False
  attention_dropout: 0.0
  classifier_activation: "gelu"

pretraining:
  # Dataset
  # Note: these paths are RELATIVE to where you RUN the command NOT the YAML file.
  train_fasta: "output/data/split/train.fasta"
  val_fasta: "output/data/split/val.fasta"

  # Output model path
  ckp_dir: "output/pretraining_checkpoints"

  # Hyperparameters
  max_length: 512
  batch_size: 32
  num_epochs: 10
  # Info for lazy dataset loading
  # True: Low memory usage, tokenize on-demand (slower iteration, faster startup)
  # False: High memory usage, tokenize all sequences at once (faster iteration, slower startup)
  lazy_dataset: False
  warmup_ratio: 0.05
  optimizer: "adamw" # adamw, stable_adamw
  adam_beta1: 0.9
  adam_beta2: 0.999
  adam_epsilon: 1e-8
  learning_rate: 3e-6
  weight_decay: 0.0
  gradient_accumulation_steps: 1
  mlm_probability: 0.3
  mask_replace_prob: 0.8
  random_token_prob: 0.1
  keep_probability: 0.1
  logging_steps_percentage: 0.01 # 100 logging in total 
  eval_steps_percentage: 0.025 # 40 evaluations in total 
  save_steps_percentage: 0.1 # 10 saves in total 
  seed: 42
  num_workers: 0
  multi_gpu: False
  world_size: 1 # Use "auto" if you want to use all available GPUs
  run_name: "nanoplm-pretraining"

Tip: Paths are resolved relative to where you run the command (not where the YAML lives).

Requirements

Python 3.10+
macOS or Linux
GPU recommended (CPU is fine for tiny tests)

Contributing

PRs welcome. If you’re unsure where to start, open an issue with your use‑case.

Like it? Star it.

If nanoPLM saved you time, a star helps others find it and keeps development going.

↑ Back to Top

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

heispv

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Oct 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoplm-0.1.0.tar.gz (68.1 kB view details)

Uploaded Oct 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nanoplm-0.1.0-py3-none-any.whl (65.6 kB view details)

Uploaded Oct 6, 2025 Python 3

File details

Details for the file nanoplm-0.1.0.tar.gz.

File metadata

Download URL: nanoplm-0.1.0.tar.gz
Upload date: Oct 6, 2025
Size: 68.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nanoplm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c69c6d8fb70f6617d145d48147fde18d90f1c49d12ba13937af7ebb87b184d41`
MD5	`68222f5db9754d9bb39dc620a0121f7f`
BLAKE2b-256	`2276e1a386238fbf4710bf5ee2fa1084adefff16d57bba921aacd10338fd3225`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nanoplm-0.1.0.tar.gz:

Publisher: publish-to-pypi.yml on heispv/nanoplm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nanoplm-0.1.0.tar.gz
- Subject digest: c69c6d8fb70f6617d145d48147fde18d90f1c49d12ba13937af7ebb87b184d41
- Sigstore transparency entry: 584978378
- Sigstore integration time: Oct 6, 2025
Source repository:
- Permalink: heispv/nanoplm@fdcff68dd272d9fd5ab6ea27b6bcef9712fc2ed6
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/heispv
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@fdcff68dd272d9fd5ab6ea27b6bcef9712fc2ed6
- Trigger Event: push

File details

Details for the file nanoplm-0.1.0-py3-none-any.whl.

File metadata

Download URL: nanoplm-0.1.0-py3-none-any.whl
Upload date: Oct 6, 2025
Size: 65.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nanoplm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6994481e4e638a410c65d08a22e2409476c880ad09e6761d17f8401567b5c9a6`
MD5	`2e2841d432557142dcba356c8f466a8a`
BLAKE2b-256	`190291c64b6b4149289453dcdada6ddfa59de098437c122c2ae3a1a4def46731`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nanoplm-0.1.0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on heispv/nanoplm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nanoplm-0.1.0-py3-none-any.whl
- Subject digest: 6994481e4e638a410c65d08a22e2409476c880ad09e6761d17f8401567b5c9a6
- Sigstore transparency entry: 584978379
- Sigstore integration time: Oct 6, 2025
Source repository:
- Permalink: heispv/nanoplm@fdcff68dd272d9fd5ab6ea27b6bcef9712fc2ed6
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/heispv
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@fdcff68dd272d9fd5ab6ea27b6bcef9712fc2ed6
- Trigger Event: push

nanoplm 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🧬 What makes nanoPLM different?

🛠️ Install

🤖 Zero‑to‑model in 4 commands

1. Get data YAML file

2. Prepare your data

3. Get a pretrain YAML file

4. Start your pretraining

Data Preparation YAML

Pretraining YAML

Requirements

Contributing

Like it? Star it.

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance