ALTAModel SFT — instruction-tuned Kinyarwanda language models from YaliLabs.

These details have not been verified by PyPI

Project links

Project description

ALTA Models — SFT

Instruction-tuned Kinyarwanda language models from YaliLabs

ALTA is a family of language models built Kinyarwanda-first — the tokenizer, training data, and inference are optimized for Kinyarwanda rather than treated as an afterthought to English. This package gives you a clean, dependency-light runtime for chatting with ALTA models in Python or from the command line.

Installation

pip install alta-models-sft

That's it. The package pulls in torch, transformers, huggingface_hub, and safetensors — nothing else by default.

For the optional FastAPI server (alta-sft serve):

pip install "alta-models-sft[serve]"

Quick start

from alta_models_sft import ALTAChat

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft")
print(chat.chat("Mwiriwe! Ushobora kumbwira amateka y'u Rwanda?"))

Or from the terminal:

alta-sft chat --model yalilabs/alta-base-sft --stream

That's the whole thing. Below is everything you'd want to do with it.

Available models

Model	Parameters	Context	Description
`yalilabs/alta-base-sft`	~110M	4,096	Base instruction-tuned model

See huggingface.co/yalilabs for the full list. In production, pin to a specific revision:

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", revision="v1.0")

Inference cookbook

Everything below uses the same ALTAChat class. Copy-paste any block to try it.

1. Basic chat (single turn)

from alta_models_sft import ALTAChat

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft")
response = chat.chat("Sobanura ubumenyi bw'ikoranabuhanga.")
print(response)

2. Multi-turn conversation (with memory)

The model remembers prior turns. Just keep calling chat():

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    use_memory=True,
    max_history_turns=8,
)

chat.chat("Mwiriwe! Nitwa Schadrack.")
chat.chat("Witwa nde?")                # uses the previous turn as context
chat.chat("Wansubize mu magambo make.")

chat.reset()                           # clear history
chat.set_memory(False)                 # disable memory entirely

3. GPU + bfloat16 for speed

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    device="cuda",
    dtype="bfloat16",                  # "float32" | "bfloat16" | "float16"
)

4. Streaming output (token-by-token)

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", device="cuda")

# Tokens print to stdout as they're generated.
# The full response is also returned at the end.
response = chat.chat(
    "Sobanura amateka y'u Rwanda mu magambo make.",
    stream=True,
)

5. Tuning the sampler

# More focused / factual
response = chat.chat(
    "Ni iki Kigali?",
    temperature=0.3, top_p=0.85, top_k=40,
)

# More creative
response = chat.chat(
    "Andika inkuru ngufi y'amateka.",
    temperature=0.8, top_p=0.95, top_k=50,
)

# Longer outputs
response = chat.chat(
    "Sobanura uburezi mu Rwanda.",
    max_new_tokens=1024,
    repetition_penalty=1.05,
)

Parameter	Default	What it does
`temperature`	`0.5`	Lower = focused, higher = creative
`top_p`	`0.85`	Nucleus sampling threshold (`1.0` disables)
`top_k`	`40`	Keep only top-k candidates (`0` disables)
`repetition_penalty`	`1.05`	Penalize repeated tokens (`1.0` disables)
`max_new_tokens`	`512`	Maximum tokens to generate
`stream`	`False`	Print tokens as they're generated

6. Loading from a local directory

from_pretrained accepts any local path — useful if you've downloaded weights manually:

# Relative path
chat = ALTAChat.from_pretrained("./my_local_model")

# Absolute path
chat = ALTAChat.from_pretrained("/opt/models/alta-base-sft")

# Home directory
chat = ALTAChat.from_pretrained("~/models/alta")

The same code works for both local paths and Hub repos — no branching required.

7. Private repos (authentication)

import os
os.environ["HF_TOKEN"] = "hf_xxxxxxxxxxxx"
chat = ALTAChat.from_pretrained("yalilabs/alta-private-model")

# Or pass the token directly
chat = ALTAChat.from_pretrained("yalilabs/alta-private-model", token="hf_...")

8. Batch inference (process many prompts)

ALTAChat is single-conversation. For independent prompts, reset between calls:

prompts = [
    "Mwiriwe!",
    "Bite, witwa nde?",
    "Sobanura izuba.",
    "Kuki amazi ari ingenzi?",
]

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", device="cuda")
results = []
for p in prompts:
    chat.reset()                       # so prompts don't influence each other
    results.append(chat.chat(p, max_new_tokens=128))

for prompt, response in zip(prompts, results):
    print(f"Q: {prompt}\nA: {response}\n")

9. Custom system prompt

By default, the model uses a Kinyarwanda assistant persona. To override:

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    system_prompt="Uri umwarimu w'amateka. Subiza nk'umwarimu.",
)

10. Debugging: disable token masking

The model masks out non-Kinyarwanda Unicode (CJK, Arabic, etc.) by default. To see raw model output:

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    mask_non_kinyarwanda=False,        # not recommended for production
)

Command-line interface

The package installs an alta-sft command. Three subcommands cover most needs.

Interactive chat

alta-sft chat --model yalilabs/alta-base-sft --stream

In-session: /reset clears memory, /quit exits.

One-shot generation

alta-sft generate "Sobanura ubumenyi bw'ikoranabuhanga" \
    --model yalilabs/alta-base-sft \
    --temperature 0.5 \
    --max_new_tokens 256 \
    --stream

HTTP server (FastAPI)

pip install "alta-models-sft[serve]"
alta-sft serve --model yalilabs/alta-base-sft --host 0.0.0.0 --port 8000

# Health check
curl http://localhost:8000/health

# Chat
curl -X POST http://localhost:8000/chat \
  -H 'Content-Type: application/json' \
  -d '{"message": "Mwiriwe!", "temperature": 0.5, "max_new_tokens": 128}'

Interactive API docs are at http://localhost:8000/docs.

Common CLI flags

--model REPO_OR_PATH    Hub repo or local directory (required)
--revision REV          Pin to a Hub tag / branch / SHA
--device DEVICE         cpu | cuda | cuda:N
--dtype DTYPE           float32 | bfloat16 | float16
--temperature FLOAT     Sampling temperature
--top_p FLOAT           Nucleus sampling
--top_k INT             Top-k filtering
--max_new_tokens INT    Max tokens to generate
--no_memory             Disable multi-turn memory
--stream                Token-by-token output

Run alta-sft --help or alta-sft chat --help for the full list.

Production deployment

Docker

FROM python:3.11-slim
RUN pip install --no-cache-dir "alta-models-sft[serve]"
ENV ALTA_MODEL=yalilabs/alta-base-sft \
    ALTA_REVISION=v1.0 \
    ALTA_DEVICE=cpu \
    ALTA_DTYPE=float32
# Pre-download weights at build time → fast cold-start
RUN python -c "from alta_models_sft import ALTAChat; \
    ALTAChat.from_pretrained('${ALTA_MODEL}', revision='${ALTA_REVISION}')"
EXPOSE 8000
CMD ["uvicorn", "alta_models_sft.server:app", "--host", "0.0.0.0", "--port", "8000"]

Version pinning

The runtime and the model version independently. Pin both:

pip install "alta-models-sft==0.1.0"

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", revision="v1.0")

Every published model carries a model_format_version. The runtime refuses to load incompatible formats with a clear error — so a user pinning alta-models-sft==0.1.0 can never accidentally load a checkpoint that needs a newer runtime.

Troubleshooting

Model produces non-Kinyarwanda characters (CJK / Arabic)

Token masking is on by default and should prevent this. Make sure you haven't passed mask_non_kinyarwanda=False or the --no_mask CLI flag.

"Could not load tokenizer"

Pass it explicitly:

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    tokenizer_name="yalilabs/alta-tokenizer",
)

ModelFormatError on load

Your installed alta-models-sft is older than the model's format. Upgrade:

pip install -U alta-models-sft

Or pin to a model revision compatible with your installed runtime.

Out of memory on GPU

Use bfloat16:

chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    device="cuda", dtype="bfloat16",
)

Slow first generation

The first call always pays a one-time cost (CUDA kernel autotuning, tokenizer warm-up). Subsequent calls are much faster. The FastAPI server pre-warms on startup to avoid this on first request.

License

Apache 2.0 — free for commercial and non-commercial use.

Citation

@software{alta_models_sft_2026,
  author  = {YaliLabs},
  title   = {ALTA Models — SFT: Instruction-tuned Kinyarwanda Language Models},
  year    = {2026},
  url     = {https://pypi.org/project/alta-models-sft/},
  version = {0.1.0},
}

Built by YaliLabs for Kinyarwanda speakers worldwide

Website · Models on 🤗

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.1

May 29, 2026

1.1.0

May 29, 2026

1.0.0

May 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alta_models_sft-1.1.1.tar.gz (20.3 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

alta_models_sft-1.1.1-py3-none-any.whl (27.1 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file alta_models_sft-1.1.1.tar.gz.

File metadata

Download URL: alta_models_sft-1.1.1.tar.gz
Upload date: May 29, 2026
Size: 20.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for alta_models_sft-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`99b31f14ef7165c34792a961215776778e3ea9f6e507d6d8681a87e77c5ed38a`
MD5	`a681081b858c04f14e4fae3363b0a340`
BLAKE2b-256	`f7e47f407150fe2a7a50381d8bc9f6b6f44e4e26482d12f3d122dccf039fd0f0`

See more details on using hashes here.

File details

Details for the file alta_models_sft-1.1.1-py3-none-any.whl.

File metadata

Download URL: alta_models_sft-1.1.1-py3-none-any.whl
Upload date: May 29, 2026
Size: 27.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for alta_models_sft-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b1b08d51fa86700b00d841de3881f2c0a0628be5a247f7d75963c98d26c989d1`
MD5	`f9c1e28b608f0626ae398c8964f99cdd`
BLAKE2b-256	`e239a373155cec4b2cdad1944aff8378b20dd1c5091661e5bc3f4e6f22d5e082`

See more details on using hashes here.

alta-models-sft 1.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ALTA Models — SFT

Installation

Quick start

Available models

Inference cookbook

1. Basic chat (single turn)

2. Multi-turn conversation (with memory)

3. GPU + bfloat16 for speed

4. Streaming output (token-by-token)

5. Tuning the sampler

6. Loading from a local directory

7. Private repos (authentication)

8. Batch inference (process many prompts)

9. Custom system prompt

10. Debugging: disable token masking

Command-line interface

Interactive chat

One-shot generation

HTTP server (FastAPI)

Common CLI flags

Production deployment

Docker

Version pinning

Troubleshooting

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes