ALTAModel SFT — instruction-tuned Kinyarwanda language models from YaliLabs.
Project description
ALTA is a family of language models built Kinyarwanda-first — the tokenizer, training data, and inference are optimized for Kinyarwanda rather than treated as an afterthought to English. This package gives you a clean, dependency-light runtime for chatting with ALTA models in Python or from the command line.
Installation
pip install alta-models-sft
That's it. The package pulls in torch, transformers, huggingface_hub, and safetensors — nothing else by default.
For the optional FastAPI server (alta-sft serve):
pip install "alta-models-sft[serve]"
Quick start
from alta_models_sft import ALTAChat
chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft")
print(chat.chat("Mwiriwe! Ushobora kumbwira amateka y'u Rwanda?"))
Or from the terminal:
alta-sft chat --model yalilabs/alta-base-sft --stream
That's the whole thing. Below is everything you'd want to do with it.
Available models
| Model | Parameters | Context | Description |
|---|---|---|---|
yalilabs/alta-base-sft |
~110M | 4,096 | Base instruction-tuned model |
See huggingface.co/yalilabs for the full list. In production, pin to a specific revision:
chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", revision="v1.0")
Inference cookbook
Everything below uses the same ALTAChat class. Copy-paste any block to try it.
1. Basic chat (single turn)
from alta_models_sft import ALTAChat
chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft")
response = chat.chat("Sobanura ubumenyi bw'ikoranabuhanga.")
print(response)
2. Multi-turn conversation (with memory)
The model remembers prior turns. Just keep calling chat():
chat = ALTAChat.from_pretrained(
"yalilabs/alta-base-sft",
use_memory=True,
max_history_turns=8,
)
chat.chat("Mwiriwe! Nitwa Schadrack.")
chat.chat("Witwa nde?") # uses the previous turn as context
chat.chat("Wansubize mu magambo make.")
chat.reset() # clear history
chat.set_memory(False) # disable memory entirely
3. GPU + bfloat16 for speed
chat = ALTAChat.from_pretrained(
"yalilabs/alta-base-sft",
device="cuda",
dtype="bfloat16", # "float32" | "bfloat16" | "float16"
)
4. Streaming output (token-by-token)
chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", device="cuda")
# Tokens print to stdout as they're generated.
# The full response is also returned at the end.
response = chat.chat(
"Sobanura amateka y'u Rwanda mu magambo make.",
stream=True,
)
5. Tuning the sampler
# More focused / factual
response = chat.chat(
"Ni iki Kigali?",
temperature=0.3, top_p=0.85, top_k=40,
)
# More creative
response = chat.chat(
"Andika inkuru ngufi y'amateka.",
temperature=0.8, top_p=0.95, top_k=50,
)
# Longer outputs
response = chat.chat(
"Sobanura uburezi mu Rwanda.",
max_new_tokens=1024,
repetition_penalty=1.05,
)
| Parameter | Default | What it does |
|---|---|---|
temperature |
0.5 |
Lower = focused, higher = creative |
top_p |
0.85 |
Nucleus sampling threshold (1.0 disables) |
top_k |
40 |
Keep only top-k candidates (0 disables) |
repetition_penalty |
1.05 |
Penalize repeated tokens (1.0 disables) |
max_new_tokens |
512 |
Maximum tokens to generate |
stream |
False |
Print tokens as they're generated |
6. Loading from a local directory
from_pretrained accepts any local path — useful if you've downloaded weights manually:
# Relative path
chat = ALTAChat.from_pretrained("./my_local_model")
# Absolute path
chat = ALTAChat.from_pretrained("/opt/models/alta-base-sft")
# Home directory
chat = ALTAChat.from_pretrained("~/models/alta")
The same code works for both local paths and Hub repos — no branching required.
7. Private repos (authentication)
import os
os.environ["HF_TOKEN"] = "hf_xxxxxxxxxxxx"
chat = ALTAChat.from_pretrained("yalilabs/alta-private-model")
# Or pass the token directly
chat = ALTAChat.from_pretrained("yalilabs/alta-private-model", token="hf_...")
8. Batch inference (process many prompts)
ALTAChat is single-conversation. For independent prompts, reset between calls:
prompts = [
"Mwiriwe!",
"Bite, witwa nde?",
"Sobanura izuba.",
"Kuki amazi ari ingenzi?",
]
chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", device="cuda")
results = []
for p in prompts:
chat.reset() # so prompts don't influence each other
results.append(chat.chat(p, max_new_tokens=128))
for prompt, response in zip(prompts, results):
print(f"Q: {prompt}\nA: {response}\n")
9. Custom system prompt
By default, the model uses a Kinyarwanda assistant persona. To override:
chat = ALTAChat.from_pretrained(
"yalilabs/alta-base-sft",
system_prompt="Uri umwarimu w'amateka. Subiza nk'umwarimu.",
)
10. Debugging: disable token masking
The model masks out non-Kinyarwanda Unicode (CJK, Arabic, etc.) by default. To see raw model output:
chat = ALTAChat.from_pretrained(
"yalilabs/alta-base-sft",
mask_non_kinyarwanda=False, # not recommended for production
)
Command-line interface
The package installs an alta-sft command. Three subcommands cover most needs.
Interactive chat
alta-sft chat --model yalilabs/alta-base-sft --stream
In-session: /reset clears memory, /quit exits.
One-shot generation
alta-sft generate "Sobanura ubumenyi bw'ikoranabuhanga" \
--model yalilabs/alta-base-sft \
--temperature 0.5 \
--max_new_tokens 256 \
--stream
HTTP server (FastAPI)
pip install "alta-models-sft[serve]"
alta-sft serve --model yalilabs/alta-base-sft --host 0.0.0.0 --port 8000
# Health check
curl http://localhost:8000/health
# Chat
curl -X POST http://localhost:8000/chat \
-H 'Content-Type: application/json' \
-d '{"message": "Mwiriwe!", "temperature": 0.5, "max_new_tokens": 128}'
Interactive API docs are at http://localhost:8000/docs.
Common CLI flags
--model REPO_OR_PATH Hub repo or local directory (required)
--revision REV Pin to a Hub tag / branch / SHA
--device DEVICE cpu | cuda | cuda:N
--dtype DTYPE float32 | bfloat16 | float16
--temperature FLOAT Sampling temperature
--top_p FLOAT Nucleus sampling
--top_k INT Top-k filtering
--max_new_tokens INT Max tokens to generate
--no_memory Disable multi-turn memory
--stream Token-by-token output
Run alta-sft --help or alta-sft chat --help for the full list.
Production deployment
Docker
FROM python:3.11-slim
RUN pip install --no-cache-dir "alta-models-sft[serve]"
ENV ALTA_MODEL=yalilabs/alta-base-sft \
ALTA_REVISION=v1.0 \
ALTA_DEVICE=cpu \
ALTA_DTYPE=float32
# Pre-download weights at build time → fast cold-start
RUN python -c "from alta_models_sft import ALTAChat; \
ALTAChat.from_pretrained('${ALTA_MODEL}', revision='${ALTA_REVISION}')"
EXPOSE 8000
CMD ["uvicorn", "alta_models_sft.server:app", "--host", "0.0.0.0", "--port", "8000"]
Version pinning
The runtime and the model version independently. Pin both:
pip install "alta-models-sft==0.1.0"
chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", revision="v1.0")
Every published model carries a model_format_version. The runtime refuses to load incompatible formats with a clear error — so a user pinning alta-models-sft==0.1.0 can never accidentally load a checkpoint that needs a newer runtime.
Troubleshooting
Model produces non-Kinyarwanda characters (CJK / Arabic)
Token masking is on by default and should prevent this. Make sure you haven't passed mask_non_kinyarwanda=False or the --no_mask CLI flag.
"Could not load tokenizer"
Pass it explicitly:
chat = ALTAChat.from_pretrained(
"yalilabs/alta-base-sft",
tokenizer_name="yalilabs/alta-tokenizer",
)
ModelFormatError on load
Your installed alta-models-sft is older than the model's format. Upgrade:
pip install -U alta-models-sft
Or pin to a model revision compatible with your installed runtime.
Out of memory on GPU
Use bfloat16:
chat = ALTAChat.from_pretrained(
"yalilabs/alta-base-sft",
device="cuda", dtype="bfloat16",
)
Slow first generation
The first call always pays a one-time cost (CUDA kernel autotuning, tokenizer warm-up). Subsequent calls are much faster. The FastAPI server pre-warms on startup to avoid this on first request.
License
Apache 2.0 — free for commercial and non-commercial use.
Citation
@software{alta_models_sft_2026,
author = {YaliLabs},
title = {ALTA Models — SFT: Instruction-tuned Kinyarwanda Language Models},
year = {2026},
url = {https://pypi.org/project/alta-models-sft/},
version = {0.1.0},
}
Built by YaliLabs for Kinyarwanda speakers worldwide
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file alta_models_sft-1.1.1.tar.gz.
File metadata
- Download URL: alta_models_sft-1.1.1.tar.gz
- Upload date:
- Size: 20.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99b31f14ef7165c34792a961215776778e3ea9f6e507d6d8681a87e77c5ed38a
|
|
| MD5 |
a681081b858c04f14e4fae3363b0a340
|
|
| BLAKE2b-256 |
f7e47f407150fe2a7a50381d8bc9f6b6f44e4e26482d12f3d122dccf039fd0f0
|
File details
Details for the file alta_models_sft-1.1.1-py3-none-any.whl.
File metadata
- Download URL: alta_models_sft-1.1.1-py3-none-any.whl
- Upload date:
- Size: 27.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1b08d51fa86700b00d841de3881f2c0a0628be5a247f7d75963c98d26c989d1
|
|
| MD5 |
f9c1e28b608f0626ae398c8964f99cdd
|
|
| BLAKE2b-256 |
e239a373155cec4b2cdad1944aff8378b20dd1c5091661e5bc3f4e6f22d5e082
|