Skip to main content

ALTAModel SFT — instruction-tuned Kinyarwanda language models from YaliLabs.

Project description

alta-models-sft (internal)

Monorepo for the ALTA SFT runtime package and its training pipeline. This README is for internal use — anyone with repo access. The public-facing PyPI README is PYPI_README.md and ships with the wheel.

Confidential. Training scripts, datasets, internal benchmarks, and unpublished checkpoints should never be checked in. See .gitignore for excluded paths.


Contents


What's in this repo

alta-models-sft/
├── src/alta_models_sft/          ← Runtime package (the only thing shipped to PyPI)
│   ├── modeling/                 ← Model architecture (RoPE, GQA, SwiGLU, blocks)
│   ├── inference/                ← ALTAChat, ChatML, sampling, masking
│   ├── hub.py                    ← Local + Hub model resolution
│   ├── cli.py                    ← `alta-sft` CLI
│   └── server.py                 ← FastAPI server (extra dep)
│
├── training/                     ← Training pipeline (stays in repo)
│   ├── train.py                  ← Main training entry point
│   ├── config.py                 ← All hyperparameters
│   ├── dataset.py                ← SFT dataset + ChatML masking + collator
│   ├── builder.py                ← Wraps ALTAModel for training
│   ├── checkpoint.py             ← TopK manager, save/load
│   ├── distributed.py            ← DDP setup
│   ├── deduplicate.py            ← MinHash + LSH dedup
│   ├── build_multiturn.py        ← Multi-turn synthesis from single-turn data
│   ├── resource_monitor.py       ← GPU/CPU/RAM telemetry
│   └── ...
│
├── scripts/                      ← Operational tools (never shipped)
│   ├── test_inference.py         ← 8-subcommand model tester
│   ├── export_for_release.py     ← Training checkpoint → release directory
│   └── upload_to_hub.sh          ← Safe Hub upload with validation
│
├── tests/                        ← pytest suite
├── .github/workflows/            ← CI: tests + PyPI release
├── pyproject.toml                ← Package metadata (controls what ships)
├── README.md                     ← THIS file (internal)
└── PYPI_README.md                ← Public README (gets bundled into the wheel)

Important: The wheel only includes src/alta_models_sft/. The [tool.hatch.build.targets.wheel] section in pyproject.toml enforces this, and CI fails if training/, scripts/, or tests/ leak in.


First-time setup

git clone git@github.com:yalilabs/alta-models-sft.git
cd alta-models-sft

python -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"

Verify everything works:

pytest                                  # all tests should pass
ruff check src tests                    # lint should be clean
alta-sft --version                      # CLI installed
python -m training.train --help         # training importable

You also need:

huggingface-cli login                   # for Hub uploads
# Optional: set HF_TOKEN in your shell for non-interactive use

Day-to-day workflows

Training a new SFT model

1. Prepare data

Training data goes in ./data/ (gitignored). Supported per-sample formats — any mix works in one JSONL:

{"question": "...", "answer": "..."}
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"instruction": "...", "input": "...", "output": "..."}
{"document": "...", "summary": "..."}

Recommended preprocessing pipeline:

# 1. Deduplicate (writes <output>.jsonl + .report.txt + .duplicates.jsonl + .stats.json)
python -m training.deduplicate \
    --input ./data/raw.jsonl \
    --output ./data/clean \
    --threshold 0.85

# 2. Synthesize multi-turn samples (helps with conversational coherence)
python -m training.build_multiturn \
    --input ./data/clean.jsonl \
    --output ./data/training.jsonl \
    --multiturn_ratio 0.3 \
    --max_chain_length 3

# 3. Hold out a validation split (any way you like)
shuf ./data/training.jsonl | head -1000 > ./data/testing.jsonl
shuf ./data/training.jsonl | tail -n +1001 > ./data/training_split.jsonl

2. Run training

Single GPU:

python -m training.train \
    --pretrained_dir ./pretrained/alta_base \
    --train_data ./data/training_split.jsonl \
    --val_data ./data/testing.jsonl \
    --output_dir ./sft_output

Multi-GPU (DDP via torchrun):

torchrun --nproc_per_node=4 -m training.train \
    --pretrained_dir ./pretrained/alta_base \
    --train_data ./data/training_split.jsonl \
    --val_data ./data/testing.jsonl \
    --output_dir ./sft_output

Resume from a previous checkpoint:

python -m training.train \
    --resume ./sft_output/checkpoints/alta_epoch003_step1500_loss1.8234.pt \
    --train_data ./data/training_split.jsonl \
    --val_data ./data/testing.jsonl \
    --output_dir ./sft_output

Hyperparameters live in training/config.py. Common ones can be overridden via CLI:

python -m training.train ... \
    --epochs 5 \
    --batch_size 16 \
    --target_lr 1e-5 \
    --max_seq_len 2048 \
    --grad_accum_steps 4

3. Monitor

tensorboard --logdir ./sft_output/tensorboard
tail -f ./sft_output/logs/train_rank0.log

Watch for: val_loss decreasing each epoch, train_loss not diverging from val_loss, sample generations becoming coherent. The expected val_loss range after epoch 1 is in config.py (expected_val_loss_at_epoch_1).

4. When training finishes

train.py automatically calls save_pretrained() on the best model. The output is at ./sft_output/alta_sft_final/:

sft_output/alta_sft_final/
├── config.json                   # includes model_format_version
└── model.safetensors             # safetensors format, ready to distribute

This directory is already in the distribution format — you can load it immediately:

alta-sft chat --model ./sft_output/alta_sft_final

Testing a trained model

scripts/test_inference.py has 8 subcommands. Run from repo root.

# 1. Quick smoke test (3 prompts, <1 min) — ALWAYS run this first
python scripts/test_inference.py smoke --model ./sft_output/alta_sft_final

# 2. Full prompt suite (writes JSON report)
python scripts/test_inference.py suite \
    --model ./sft_output/alta_sft_final \
    --output ./results/run_$(date +%Y%m%d_%H%M).json

# 3. Interactive REPL for qualitative exploration
python scripts/test_inference.py chat \
    --model ./sft_output/alta_sft_final --stream

# 4. Single prompt
python scripts/test_inference.py single \
    --model ./sft_output/alta_sft_final \
    --prompt "Sobanura amateka y'u Rwanda" --stream

# 5. Multi-turn conversation test (catches memory bugs)
python scripts/test_inference.py multiturn --model ./sft_output/alta_sft_final

# 6. Sampling comparison (same prompt, different configs)
python scripts/test_inference.py compare \
    --model ./sft_output/alta_sft_final \
    --prompt "Mwiriwe!" \
    --configs '[{"temperature":0.3},{"temperature":0.8,"top_p":0.95}]'

# 7. Mask ablation (loads model twice — with/without non-Kinyarwanda mask)
python scripts/test_inference.py mask_ablation \
    --model ./sft_output/alta_sft_final --prompt "Bite?"

# 8. Throughput benchmark
python scripts/test_inference.py bench \
    --model ./sft_output/alta_sft_final --num_prompts 20 --device cuda --dtype bfloat16

Promotion criteria before releasing a checkpoint publicly:

  • smoke passes (no crashes, non-empty responses)
  • suite has zero crashes; spot-check at least 3 categories of responses look reasonable
  • multiturn shows the model uses prior context (doesn't repeat introductions)
  • mask_ablation shows the model produces clean Kinyarwanda even without the mask (a real fluency check)
  • bench throughput is within expected range for the target hardware

Exporting for distribution

train.py already saves in the distribution format, so this step is only needed if you want to:

  • Bundle a tokenizer into the directory
  • Tag the export with a release version string
  • Convert an old .pt checkpoint to safetensors
python scripts/export_for_release.py \
    --checkpoint ./sft_output/alta_sft_final \
    --output ./release/alta-base-sft-v1.0 \
    --version v1.0 \
    --include_tokenizer \
    --tokenizer yalilabs/alta-tokenizer

Output:

release/alta-base-sft-v1.0/
├── config.json                   # with release_version + release_date metadata
├── model.safetensors
├── tokenizer.json                # bundled
├── special_tokens_map.json
├── tokenizer_config.json
└── README.md                     # auto-generated model card

Uploading weights to Hugging Face

Use upload_to_hub.sh — it validates everything (auth, repo existence, load test, tag collision) before uploading.

# Standard release
./scripts/upload_to_hub.sh \
    --model_dir ./release/alta-base-sft-v1.0 \
    --repo yalilabs/alta-base-sft \
    --version v1.0

# First-time release of a new model (creates repo if missing)
./scripts/upload_to_hub.sh \
    --model_dir ./release/alta-base-sft-v0.9 \
    --repo yalilabs/alta-base-sft \
    --version v0.9 \
    --private --create_repo

# CI-friendly (no prompts)
./scripts/upload_to_hub.sh \
    --model_dir ./release/alta-base-sft-v1.0 \
    --repo yalilabs/alta-base-sft \
    --version v1.0 --yes

# Dry-run to validate without uploading
./scripts/upload_to_hub.sh \
    --model_dir ./release/alta-base-sft-v1.0 \
    --repo yalilabs/alta-base-sft \
    --version v1.0 --dry_run

After upload, always verify by clearing the cache and loading fresh:

rm -rf ~/.cache/huggingface/hub/models--yalilabs--alta-base-sft
alta-sft chat --model yalilabs/alta-base-sft --revision v1.0

Cutting a runtime package release

The package on PyPI versions independently of model weights. Bump the package version only when the runtime code changes — not when only weights change.

When to bump:

Change Bump
Bug fix in inference / CLI / server patch (0.1.00.1.1)
New CLI flag, new optional arg, new public function minor (0.1.00.2.0)
Removed function, renamed class, changed default behavior major (0.1.01.0.0)
Breaking change to config.json schema bump MODEL_FORMAT_MAX in _version.py AND major bump

Steps:

  1. Update src/alta_models_sft/_version.py:

    __version__ = "0.2.0"
    
  2. Update CHANGELOG.md (top of file):

    ## [0.2.0] - 2026-06-15
    ### Added
    - Stream support for `alta-sft generate`
    ### Fixed
    - KV cache overflow on 4096-token contexts
    
  3. Commit, tag, push:

    git add . && git commit -m "Release 0.2.0"
    git tag v0.2.0
    git push origin main --tags
    
  4. GitHub Actions takes over.github/workflows/release.yml builds the wheel, verifies training code is excluded, and publishes to PyPI via trusted publishing.

  5. Verify on PyPI:

    pip install -U alta-models-sft
    alta-sft --version          # should show 0.2.0
    

Architecture: how training and the package share code

The single most important design decision in this repo: the model architecture is defined exactly once, in src/alta_models_sft/modeling/model.py. Training and inference both import from there.

                       ┌──────────────────────────────────────────┐
                       │  src/alta_models_sft/modeling/model.py  │
                       │  ALTAModel — single definition           │
                       └──────────────────────┬───────────────────┘
                                              │
                  ┌───────────────────────────┼───────────────────────────┐
                  │                           │                           │
                  ▼                           ▼                           ▼
       training/train.py        src/alta_models_sft/inference        external users
       (calls init_weights,     (ALTAChat.from_pretrained)            via `pip install`
        gradient ckpt,           — no init, no training paths
        chunked CE loss)

The model class has both training capabilities (chunked CE loss, weight init, gradient checkpointing toggles) and inference paths (KV-cached generation, safetensors loading). Inference users never invoke the training methods — they're just there, unused.

Why this matters: there's zero possibility of architecture drift between training-time and inference-time code. The shape of every tensor, the order of operations, the special tokens — all guaranteed identical.

Don't add a training_model.py that re-implements parts of the architecture. Don't copy modeling code into training/. If training needs something the model doesn't have, add it to the model class with a flag and document why.


Versioning policy

Two version numbers, kept independent:

  1. Package version (src/alta_models_sft/_version.py__version__)

    • Versions the inference runtime, CLI, server.
    • Follows SemVer.
    • Released to PyPI.
  2. Model revision (Hugging Face tags: v1.0, v1.1, v2.0-instruct, etc.)

    • Versions the actual weights.
    • Released to Hugging Face Hub.

The runtime checks the model's model_format_version against its supported range (MODEL_FORMAT_MIN..MODEL_FORMAT_MAX). If incompatible, loading fails with a clear error pointing at the fix.

Rule of thumb: users in production should pin both:

pip install "alta-models-sft==0.1.0"
ALTAChat.from_pretrained("yalilabs/alta-base-sft", revision="v1.0")

Operations

Running CI locally before pushing

ruff check src tests
pytest --cov=alta_models_sft

# Build the wheel and verify training code is NOT included
python -m build
python -m zipfile -l dist/*.whl | grep -E "^(training/|scripts/|tests/)"
# ↑ Should print nothing. If anything prints, fix pyproject.toml.

Common gotchas

  • DDP runs need torchrun. Plain python -m training.train only uses one GPU even on multi-GPU machines.
  • Tokenizer/model vocab mismatch. If you change the tokenizer, you must re-pretrain — SFT can't recover from a vocab mismatch.
  • max_seq_len truncation drops assistant turns. Long multi-turn samples that exceed max_seq_len get truncated from the right, which may remove the supervised target. The dataset logs this; check the filter breakdown.
  • PyPI is forever. Never re-publish the same version number with different content. If 0.2.0 has a bug, release 0.2.1.
  • HF Hub tags should also be immutable in practice. Don't re-tag v1.0 — release v1.0.1.

Where to look when things break

Symptom First place to check
Training crashes immediately ./sft_output/logs/train_rank0.log — usually a data-format issue
Training loss stuck high Tokenizer/vocab mismatch; or mask_user_tokens config is wrong
Sample generations are garbage Try mask_ablation test; verify ChatML format matches training
PyPI upload fails Check _version.py matches the git tag; check trusted publishing config
HF upload fails auth huggingface-cli whoami — token may have expired
Model loads on Hub but not locally Run python -c "import alta_models_sft; print(alta_models_sft.__version__)" to verify install

Contacts

  • Training questions: #alta-training Slack channel
  • Infra / Hub uploads: #ml-platform
  • Public releases: tag @releases in #alta-models

License

The runtime package is Apache 2.0 (see LICENSE). Training data, internal benchmarks, and unpublished checkpoints are internal only and must not be checked into this repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alta_models_sft-1.0.0.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alta_models_sft-1.0.0-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file alta_models_sft-1.0.0.tar.gz.

File metadata

  • Download URL: alta_models_sft-1.0.0.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for alta_models_sft-1.0.0.tar.gz
Algorithm Hash digest
SHA256 66848bfb1d0740e28b58adea313a88655471f2ce4d1518d751105594d420079e
MD5 9ab3f5c629e4c131b0f74377870ec433
BLAKE2b-256 c18554ac73ced0d79d73c15e135db692123d4ee7d5773227062e7907d7248f3d

See more details on using hashes here.

File details

Details for the file alta_models_sft-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for alta_models_sft-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8107309c27aab16caebe4aca5de731208802df9c5a49663c6912ced6e924bd0
MD5 73e28dd27431e6b1306f3614f4da3f72
BLAKE2b-256 edfe0d71b5dddb066de950038c6ce87fc83d6ad59f10385e320083b336be7329

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page