The fastest inference framework to run BitNet models on CPUs
Project description
Trillim
Quick Start
Installation
- Python 3.12+ required
- glibc 2.27+ required (if on Linux)
- Install with uv (recommended) or pip
Pick your platform for full instructions:
Note: The rest of this README shows bare
trillimcommands. If you're using uv, prefix each command withuv run(e.g.uv run trillim chat ...).
Quantize your own model
If you have a HuggingFace BitNet model with safetensors weights:
# Quantize model weights → qmodel.tensors + rope.cache
trillim quantize <path-to-model> --model
# Optionally extract a PEFT LoRA adapter → qmodel.lora
trillim quantize <path-to-model> --adapter <path-to-adapter>
Chat
Start an interactive conversation in your terminal:
trillim chat Trillim/BitNet-TRNQ
Multi-turn conversations are supported with automatic prompt caching for fast follow-ups. Use /new to start a fresh conversation, or q to quit.
See the Chat guide for details on LoRA adapters, sampling parameters, and performance tips.
Search-Augmented Chat
Trillim supports pluggable inference harnesses. For web-search-enabled models, use:
trillim chat Trillim/BitNet-Search-TRNQ --harness search
By default, search uses DuckDuckGo (ddgs). To use Brave:
export SEARCH_API_KEY=<your_api_key>
trillim chat Trillim/BitNet-Search-TRNQ --harness search --search-provider brave
The search harness emits status markers while it runs search and synthesis steps. See Chat for full behavior and troubleshooting.
API Server
Trillim includes an OpenAI-compatible API server:
# Start the server
trillim serve Trillim/BitNet-TRNQ
# With voice pipeline (speech-to-text + text-to-speech)
# Requires optional `voice` dependencies:
# docs/server.md -> "Voice Optional Dependencies"
trillim serve Trillim/BitNet-TRNQ --voice
Endpoints:
POST /v1/chat/completions— chat completions (streaming supported)POST /v1/completions— text completionsGET /v1/models— list loaded modelsPOST /v1/models/load— hot-swap models, LoRA adapters, and harness/search settings at runtimePOST /v1/audio/transcriptions— speech-to-text (with--voice)POST /v1/audio/speech— text-to-speech (with--voice)GET /v1/voices— list available TTS voicesPOST /v1/voices— register a custom voice from audio (see Voice Cloning Setup)
For server-side search harness, start normally and then set "harness": "search" (plus optional "search_provider") through POST /v1/models/load.
Works with the OpenAI Python client out of the box:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="BitNet-TRNQ",
messages=[{"role": "user", "content": "Hello!"}],
)
See the Server guide for full endpoint documentation, request/response schemas, the Python SDK, and voice pipeline usage.
LoRA Adapters
Trillim supports PEFT LoRA adapters as bf16 corrections on top of the ternary base model. The adapter lives in its own directory (separate from the base model) and must be quantized first:
# Quantize a PEFT adapter into Trillim's format
trillim quantize <path-to-base-model> --adapter <path-to-adapter>
# Chat with the base model + adapter
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir>
# Or pull a pre-quantized adapter and use it by ID
trillim pull Trillim/BitNet-GenZ-LoRA-TRNQ
trillim chat Trillim/BitNet-TRNQ --lora Trillim/BitNet-GenZ-LoRA-TRNQ
Adapters can also be hot-swapped at runtime via the API server's POST /v1/models/load endpoint. See the Server guide for details.
Runtime Quantization
Separately from the offline trillim quantize step (which converts model weights to ternary), Trillim can quantize specific layers at inference time to reduce memory usage. This is controlled with two flags available on both chat and serve:
--lora-quant <type>— quantize LoRA adapter layers. Options:none,int8,q4_0,q5_0,q6_k,q8_0. Only applies when using--lora.--unembed-quant <type>— quantize the unembedding (output projection) layer. Options:int8,q4_0,q5_0,q6_k,q8_0.
# Quantize LoRA layers to int8 for lower memory
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir> --lora-quant int8
# Quantize the unembed layer to q4_0
trillim chat Trillim/BitNet-TRNQ --unembed-quant q4_0
# Both at once
trillim serve Trillim/BitNet-TRNQ --lora-quant q8_0 --unembed-quant q4_0
Lower quantization levels (e.g. q4_0) use less memory at a small quality cost. These options can also be set per-request when hot-swapping models via POST /v1/models/load. See the CLI reference for the full flag list.
Voice Cloning Setup
The voice pipeline (--voice) includes 8 predefined voices that work out of the box: alba, marius, javert, jean, fantine, cosette, eponine, azelma.
To register custom voices (voice cloning via POST /v1/voices), you need to accept the PocketTTS model terms and authenticate with HuggingFace:
- Go to kyutai/pocket-tts on HuggingFace and accept the model's terms.
- Create a token on HuggingFace (under Access Tokens) with
Readpermissions. - Log in locally so the token is available to download the voice cloning weights:
hf auth login
This only needs to be done once. After that, custom voice registration works automatically. If you skip this step, you'll get an error when trying to register a custom voice — predefined voices will still work fine.
Supported Architectures
BitnetForCausalLM— BitNet with ternary weights and ReLU² activationLlamaForCausalLM— Llama-style with SiLU activation
Platform Support
| Platform | Status |
|---|---|
| x86_64 (AVX2) | Supported |
| ARM64 (NEON) | Supported |
Thread count is auto-detected as num_cores - 2. Override by passing a --threads N CLI argument.
Documentation
- What is Trillim? — overview, motivation, and who it's for
- Install — macOS | Linux | Windows
- CLI Reference — all commands and flags
- Chat — interactive chat interface
- Server — API endpoints, Python SDK, and OpenAI client usage
License
The Trillim Python SDK source code is MIT-licensed. The C++ inference engine binaries (inference, trillim-quantize) bundled in the pip package are proprietary — you may use them as part of Trillim but may not reverse-engineer or redistribute them separately. See LICENSE for full terms.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trillim-0.5.1-py3-none-win_arm64.whl.
File metadata
- Download URL: trillim-0.5.1-py3-none-win_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: Python 3, Windows ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81df8bffd11f75e659457bd44b6256350b6db2da24457dd346ef30ddc6cdecad
|
|
| MD5 |
c38eabd4e747ed9f71516ee8544297d3
|
|
| BLAKE2b-256 |
4a0cb95dd356c8c3d077245ad7c9680fb710346d1968ff5bc76cddfe683ce5a7
|
File details
Details for the file trillim-0.5.1-py3-none-win_amd64.whl.
File metadata
- Download URL: trillim-0.5.1-py3-none-win_amd64.whl
- Upload date:
- Size: 2.5 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05fef9d211d38e38fb343fa5c1fc6c0ff2541c4a46f1136e684f0f6e3514834b
|
|
| MD5 |
c509799e79469f3146059ebc56016ee5
|
|
| BLAKE2b-256 |
36cf5e14a32a20aefd58e06ea9d87d2fbc87503a054c3245f48fffc2cf15645d
|
File details
Details for the file trillim-0.5.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: trillim-0.5.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 12.4 MB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b40c7a1941574c8134a9c69f57a92fcd77fe510cca10a5f311b6abc5eb2784b
|
|
| MD5 |
2a5b1de2da4cdb6305cff159bd213de8
|
|
| BLAKE2b-256 |
22ae71bef6411a2af36ee8baebf4fa2e9357613088fb70e90287da7c0acaee21
|
File details
Details for the file trillim-0.5.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: trillim-0.5.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 12.7 MB
- Tags: Python 3, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6aa5aff00371c75f032d71c8d6fd6345cc772b3239526b2848346934ac9dfb94
|
|
| MD5 |
b28e13337aefeef487a6276e7f39328e
|
|
| BLAKE2b-256 |
ff32ca81645cd0c4b76407a7571d9b9cc9ab2e724076119101a5bfc28d57062a
|
File details
Details for the file trillim-0.5.1-py3-none-macosx_11_0_x86_64.whl.
File metadata
- Download URL: trillim-0.5.1-py3-none-macosx_11_0_x86_64.whl
- Upload date:
- Size: 2.1 MB
- Tags: Python 3, macOS 11.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94caee96381690b77f7a3900ead878d2ddab1d697c60602907954e82b8a55b1c
|
|
| MD5 |
3814709f2ce60caef27803c14d96e0b7
|
|
| BLAKE2b-256 |
c837d1964f42844f016d2de1227ae11a6fd895ee79e16a0c581b309915f40512
|
File details
Details for the file trillim-0.5.1-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: trillim-0.5.1-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.2 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3987a7e922be69b13b27dff58260ffc1d4309a4cb71f2e9e61569a1ac3ba4b19
|
|
| MD5 |
70f7ddda61bf8527e84cf4cfba4c7338
|
|
| BLAKE2b-256 |
78794d1d61e880ae1058dbbb4c8ce8b80ac758fa5f9718c2cfbe9efe3f6a0962
|