Skip to main content

ALTAModel SFT — instruction-tuned Kinyarwanda language models from YaliLabs.

Project description

ALTA Models — SFT

Instruction-tuned Kinyarwanda language models from YaliLabs

PyPI version Python License: Apache 2.0 Hugging Face


ALTA is a family of language models built Kinyarwanda-first — the tokenizer, training data, and inference are optimized for Kinyarwanda rather than treated as an afterthought to English. This package gives you a clean, dependency-light runtime for chatting with ALTA models in Python or from the command line.

Installation

pip install alta-models-sft

That's it. The package pulls in torch, transformers, huggingface_hub, and safetensors — nothing else by default.

For the optional FastAPI server (alta-sft serve):

pip install "alta-models-sft[serve]"

Quick start

from alta_models_sft import ALTAChat

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft")
print(chat.chat("Mwiriwe! Ushobora kumbwira amateka y'u Rwanda?"))

Or from the terminal:

alta-sft chat --model yalilabs/alta-base-sft --stream

That's the whole thing. Below is everything you'd want to do with it.

Available models

Model Parameters Context Description
yalilabs/alta-base-sft ~110M 4,096 Base instruction-tuned model

See huggingface.co/yalilabs for the full list. In production, pin to a specific revision:

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", revision="v1.0")

Inference cookbook

Everything below uses the same ALTAChat class. Copy-paste any block to try it.

1. Basic chat (single turn)

from alta_models_sft import ALTAChat

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft")
response = chat.chat("Sobanura ubumenyi bw'ikoranabuhanga.")
print(response)

2. Multi-turn conversation (with memory)

The model remembers prior turns. Just keep calling chat():

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    use_memory=True,
    max_history_turns=8,
)

chat.chat("Mwiriwe! Nitwa Schadrack.")
chat.chat("Witwa nde?")                # uses the previous turn as context
chat.chat("Wansubize mu magambo make.")

chat.reset()                           # clear history
chat.set_memory(False)                 # disable memory entirely

3. GPU + bfloat16 for speed

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    device="cuda",
    dtype="bfloat16",                  # "float32" | "bfloat16" | "float16"
)

4. Streaming output (token-by-token)

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", device="cuda")

# Tokens print to stdout as they're generated.
# The full response is also returned at the end.
response = chat.chat(
    "Sobanura amateka y'u Rwanda mu magambo make.",
    stream=True,
)

5. Tuning the sampler

# More focused / factual
response = chat.chat(
    "Ni iki Kigali?",
    temperature=0.3, top_p=0.85, top_k=40,
)

# More creative
response = chat.chat(
    "Andika inkuru ngufi y'amateka.",
    temperature=0.8, top_p=0.95, top_k=50,
)

# Longer outputs
response = chat.chat(
    "Sobanura uburezi mu Rwanda.",
    max_new_tokens=1024,
    repetition_penalty=1.05,
)
Parameter Default What it does
temperature 0.5 Lower = focused, higher = creative
top_p 0.85 Nucleus sampling threshold (1.0 disables)
top_k 40 Keep only top-k candidates (0 disables)
repetition_penalty 1.05 Penalize repeated tokens (1.0 disables)
max_new_tokens 512 Maximum tokens to generate
stream False Print tokens as they're generated

6. Loading from a local directory

from_pretrained accepts any local path — useful if you've downloaded weights manually:

# Relative path
chat = ALTAChat.from_pretrained("./my_local_model")

# Absolute path
chat = ALTAChat.from_pretrained("/opt/models/alta-base-sft")

# Home directory
chat = ALTAChat.from_pretrained("~/models/alta")

The same code works for both local paths and Hub repos — no branching required.

7. Private repos (authentication)

import os
os.environ["HF_TOKEN"] = "hf_xxxxxxxxxxxx"
chat = ALTAChat.from_pretrained("yalilabs/alta-private-model")

# Or pass the token directly
chat = ALTAChat.from_pretrained("yalilabs/alta-private-model", token="hf_...")

8. Batch inference (process many prompts)

ALTAChat is single-conversation. For independent prompts, reset between calls:

prompts = [
    "Mwiriwe!",
    "Bite, witwa nde?",
    "Sobanura izuba.",
    "Kuki amazi ari ingenzi?",
]

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", device="cuda")
results = []
for p in prompts:
    chat.reset()                       # so prompts don't influence each other
    results.append(chat.chat(p, max_new_tokens=128))

for prompt, response in zip(prompts, results):
    print(f"Q: {prompt}\nA: {response}\n")

9. Custom system prompt

By default, the model uses a Kinyarwanda assistant persona. To override:

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    system_prompt="Uri umwarimu w'amateka. Subiza nk'umwarimu.",
)

10. Debugging: disable token masking

The model masks out non-Kinyarwanda Unicode (CJK, Arabic, etc.) by default. To see raw model output:

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    mask_non_kinyarwanda=False,        # not recommended for production
)

Command-line interface

The package installs an alta-sft command. Three subcommands cover most needs.

Interactive chat

alta-sft chat --model yalilabs/alta-base-sft --stream

In-session: /reset clears memory, /quit exits.

One-shot generation

alta-sft generate "Sobanura ubumenyi bw'ikoranabuhanga" \
    --model yalilabs/alta-base-sft \
    --temperature 0.5 \
    --max_new_tokens 256 \
    --stream

HTTP server (FastAPI)

pip install "alta-models-sft[serve]"
alta-sft serve --model yalilabs/alta-base-sft --host 0.0.0.0 --port 8000
# Health check
curl http://localhost:8000/health

# Chat
curl -X POST http://localhost:8000/chat \
  -H 'Content-Type: application/json' \
  -d '{"message": "Mwiriwe!", "temperature": 0.5, "max_new_tokens": 128}'

Interactive API docs are at http://localhost:8000/docs.

Common CLI flags

--model REPO_OR_PATH    Hub repo or local directory (required)
--revision REV          Pin to a Hub tag / branch / SHA
--device DEVICE         cpu | cuda | cuda:N
--dtype DTYPE           float32 | bfloat16 | float16
--temperature FLOAT     Sampling temperature
--top_p FLOAT           Nucleus sampling
--top_k INT             Top-k filtering
--max_new_tokens INT    Max tokens to generate
--no_memory             Disable multi-turn memory
--stream                Token-by-token output

Run alta-sft --help or alta-sft chat --help for the full list.

Production deployment

Docker

FROM python:3.11-slim
RUN pip install --no-cache-dir "alta-models-sft[serve]"
ENV ALTA_MODEL=yalilabs/alta-base-sft \
    ALTA_REVISION=v1.0 \
    ALTA_DEVICE=cpu \
    ALTA_DTYPE=float32
# Pre-download weights at build time → fast cold-start
RUN python -c "from alta_models_sft import ALTAChat; \
    ALTAChat.from_pretrained('${ALTA_MODEL}', revision='${ALTA_REVISION}')"
EXPOSE 8000
CMD ["uvicorn", "alta_models_sft.server:app", "--host", "0.0.0.0", "--port", "8000"]

Version pinning

The runtime and the model version independently. Pin both:

pip install "alta-models-sft==0.1.0"
chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", revision="v1.0")

Every published model carries a model_format_version. The runtime refuses to load incompatible formats with a clear error — so a user pinning alta-models-sft==0.1.0 can never accidentally load a checkpoint that needs a newer runtime.

Troubleshooting

Model produces non-Kinyarwanda characters (CJK / Arabic)

Token masking is on by default and should prevent this. Make sure you haven't passed mask_non_kinyarwanda=False or the --no_mask CLI flag.

"Could not load tokenizer"

Pass it explicitly:

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    tokenizer_name="yalilabs/alta-tokenizer",
)
ModelFormatError on load

Your installed alta-models-sft is older than the model's format. Upgrade:

pip install -U alta-models-sft

Or pin to a model revision compatible with your installed runtime.

Out of memory on GPU

Use bfloat16:

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    device="cuda", dtype="bfloat16",
)
Slow first generation

The first call always pays a one-time cost (CUDA kernel autotuning, tokenizer warm-up). Subsequent calls are much faster. The FastAPI server pre-warms on startup to avoid this on first request.

License

Apache 2.0 — free for commercial and non-commercial use.

Citation

@software{alta_models_sft_2026,
  author  = {YaliLabs},
  title   = {ALTA Models — SFT: Instruction-tuned Kinyarwanda Language Models},
  year    = {2026},
  url     = {https://pypi.org/project/alta-models-sft/},
  version = {0.1.0},
}

Built by YaliLabs for Kinyarwanda speakers worldwide

Website · Models on 🤗

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alta_models_sft-1.1.1.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alta_models_sft-1.1.1-py3-none-any.whl (27.1 kB view details)

Uploaded Python 3

File details

Details for the file alta_models_sft-1.1.1.tar.gz.

File metadata

  • Download URL: alta_models_sft-1.1.1.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for alta_models_sft-1.1.1.tar.gz
Algorithm Hash digest
SHA256 99b31f14ef7165c34792a961215776778e3ea9f6e507d6d8681a87e77c5ed38a
MD5 a681081b858c04f14e4fae3363b0a340
BLAKE2b-256 f7e47f407150fe2a7a50381d8bc9f6b6f44e4e26482d12f3d122dccf039fd0f0

See more details on using hashes here.

File details

Details for the file alta_models_sft-1.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for alta_models_sft-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b1b08d51fa86700b00d841de3881f2c0a0628be5a247f7d75963c98d26c989d1
MD5 f9c1e28b608f0626ae398c8964f99cdd
BLAKE2b-256 e239a373155cec4b2cdad1944aff8378b20dd1c5091661e5bc3f4e6f22d5e082

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page