Skip to main content

The fastest inference framework to run BitNet models on CPUs

Project description

Trillim

What is Trillim?

Quick Start

Installation

  • Python 3.12+ required
  • glibc 2.27+ required (if on Linux)
  • Install with uv (recommended) or pip

Pick your platform for full instructions:

Note: The rest of this README shows bare trillim commands. If you're using uv, prefix each command with uv run (e.g. uv run trillim chat ...).

Quantize your own model

If you have a HuggingFace BitNet model with safetensors weights:

# Quantize model weights → qmodel.tensors + rope.cache
trillim quantize <path-to-model> --model

# Optionally extract a PEFT LoRA adapter → qmodel.lora
trillim quantize <path-to-model> --adapter <path-to-adapter>

Chat

Start an interactive conversation in your terminal:

trillim chat Trillim/BitNet-TRNQ

Multi-turn conversations are supported with automatic prompt caching for fast follow-ups. Use /new to start a fresh conversation, or q to quit.

See the Chat guide for details on LoRA adapters, sampling parameters, and performance tips.

Search-Augmented Chat

Trillim supports pluggable inference harnesses. For web-search-enabled models, use:

trillim chat Trillim/BitNet-Search-TRNQ --harness search

By default, search uses DuckDuckGo (ddgs). To use Brave:

export SEARCH_API_KEY=<your_api_key>
trillim chat Trillim/BitNet-Search-TRNQ --harness search --search-provider brave

The search harness emits status markers while it runs search and synthesis steps. See Chat for full behavior and troubleshooting.

API Server

Trillim includes an OpenAI-compatible API server:

# Start the server
trillim serve Trillim/BitNet-TRNQ

# With voice pipeline (speech-to-text + text-to-speech)
# Requires optional `voice` dependencies:
# docs/server.md -> "Voice Optional Dependencies"
trillim serve Trillim/BitNet-TRNQ --voice

Endpoints:

  • POST /v1/chat/completions — chat completions (streaming supported)
  • POST /v1/completions — text completions
  • GET /v1/models — list loaded models
  • POST /v1/models/load — hot-swap models, LoRA adapters, and harness/search settings at runtime
  • POST /v1/audio/transcriptions — speech-to-text (with --voice)
  • POST /v1/audio/speech — text-to-speech (with --voice)
  • GET /v1/voices — list available TTS voices
  • POST /v1/voices — register a custom voice from audio (see Voice Cloning Setup)

For server-side search harness, start normally and then set "harness": "search" (plus optional "search_provider") through POST /v1/models/load.

Works with the OpenAI Python client out of the box:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="BitNet-TRNQ",
    messages=[{"role": "user", "content": "Hello!"}],
)

See the Server guide for full endpoint documentation, request/response schemas, the Python SDK, and voice pipeline usage.

LoRA Adapters

Trillim supports PEFT LoRA adapters as bf16 corrections on top of the ternary base model. The adapter lives in its own directory (separate from the base model) and must be quantized first:

# Quantize a PEFT adapter into Trillim's format
trillim quantize <path-to-base-model> --adapter <path-to-adapter>

# Chat with the base model + adapter
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir>

# Or pull a pre-quantized adapter and use it by ID
trillim pull Trillim/BitNet-GenZ-LoRA-TRNQ
trillim chat Trillim/BitNet-TRNQ --lora Trillim/BitNet-GenZ-LoRA-TRNQ

Adapters can also be hot-swapped at runtime via the API server's POST /v1/models/load endpoint. See the Server guide for details.

Runtime Quantization

Separately from the offline trillim quantize step (which converts model weights to ternary), Trillim can quantize specific layers at inference time to reduce memory usage. This is controlled with two flags available on both chat and serve:

  • --lora-quant <type> — quantize LoRA adapter layers. Options: none, int8, q4_0, q5_0, q6_k, q8_0. Only applies when using --lora.
  • --unembed-quant <type> — quantize the unembedding (output projection) layer. Options: int8, q4_0, q5_0, q6_k, q8_0.
# Quantize LoRA layers to int8 for lower memory
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir> --lora-quant int8

# Quantize the unembed layer to q4_0
trillim chat Trillim/BitNet-TRNQ --unembed-quant q4_0

# Both at once
trillim serve Trillim/BitNet-TRNQ --lora-quant q8_0 --unembed-quant q4_0

Lower quantization levels (e.g. q4_0) use less memory at a small quality cost. These options can also be set per-request when hot-swapping models via POST /v1/models/load. See the CLI reference for the full flag list.

Voice Cloning Setup

The voice pipeline (--voice) includes 8 predefined voices that work out of the box: alba, marius, javert, jean, fantine, cosette, eponine, azelma.

To register custom voices (voice cloning via POST /v1/voices), you need to accept the PocketTTS model terms and authenticate with HuggingFace:

  1. Go to kyutai/pocket-tts on HuggingFace and accept the model's terms.
  2. Create a token on HuggingFace (under Access Tokens) with Read permissions.
  3. Log in locally so the token is available to download the voice cloning weights:
hf auth login

This only needs to be done once. After that, custom voice registration works automatically. If you skip this step, you'll get an error when trying to register a custom voice — predefined voices will still work fine.

Supported Architectures

  • BitnetForCausalLM — BitNet with ternary weights and ReLU² activation
  • LlamaForCausalLM — Llama-style with SiLU activation

Platform Support

Platform Status
x86_64 (AVX2) Supported
ARM64 (NEON) Supported

Thread count is auto-detected as num_cores - 2. Override by passing a --threads N CLI argument.

Documentation

License

The Trillim Python SDK source code is MIT-licensed. The C++ inference engine binaries (inference, trillim-quantize) bundled in the pip package are proprietary — you may use them as part of Trillim but may not reverse-engineer or redistribute them separately. See LICENSE for full terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

trillim-0.5.1-py3-none-win_arm64.whl (2.3 MB view details)

Uploaded Python 3Windows ARM64

trillim-0.5.1-py3-none-win_amd64.whl (2.5 MB view details)

Uploaded Python 3Windows x86-64

trillim-0.5.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

trillim-0.5.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (12.7 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

trillim-0.5.1-py3-none-macosx_11_0_x86_64.whl (2.1 MB view details)

Uploaded Python 3macOS 11.0+ x86-64

trillim-0.5.1-py3-none-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file trillim-0.5.1-py3-none-win_arm64.whl.

File metadata

  • Download URL: trillim-0.5.1-py3-none-win_arm64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: Python 3, Windows ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.0

File hashes

Hashes for trillim-0.5.1-py3-none-win_arm64.whl
Algorithm Hash digest
SHA256 81df8bffd11f75e659457bd44b6256350b6db2da24457dd346ef30ddc6cdecad
MD5 c38eabd4e747ed9f71516ee8544297d3
BLAKE2b-256 4a0cb95dd356c8c3d077245ad7c9680fb710346d1968ff5bc76cddfe683ce5a7

See more details on using hashes here.

File details

Details for the file trillim-0.5.1-py3-none-win_amd64.whl.

File metadata

  • Download URL: trillim-0.5.1-py3-none-win_amd64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.0

File hashes

Hashes for trillim-0.5.1-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 05fef9d211d38e38fb343fa5c1fc6c0ff2541c4a46f1136e684f0f6e3514834b
MD5 c509799e79469f3146059ebc56016ee5
BLAKE2b-256 36cf5e14a32a20aefd58e06ea9d87d2fbc87503a054c3245f48fffc2cf15645d

See more details on using hashes here.

File details

Details for the file trillim-0.5.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for trillim-0.5.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5b40c7a1941574c8134a9c69f57a92fcd77fe510cca10a5f311b6abc5eb2784b
MD5 2a5b1de2da4cdb6305cff159bd213de8
BLAKE2b-256 22ae71bef6411a2af36ee8baebf4fa2e9357613088fb70e90287da7c0acaee21

See more details on using hashes here.

File details

Details for the file trillim-0.5.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for trillim-0.5.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6aa5aff00371c75f032d71c8d6fd6345cc772b3239526b2848346934ac9dfb94
MD5 b28e13337aefeef487a6276e7f39328e
BLAKE2b-256 ff32ca81645cd0c4b76407a7571d9b9cc9ab2e724076119101a5bfc28d57062a

See more details on using hashes here.

File details

Details for the file trillim-0.5.1-py3-none-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for trillim-0.5.1-py3-none-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 94caee96381690b77f7a3900ead878d2ddab1d697c60602907954e82b8a55b1c
MD5 3814709f2ce60caef27803c14d96e0b7
BLAKE2b-256 c837d1964f42844f016d2de1227ae11a6fd895ee79e16a0c581b309915f40512

See more details on using hashes here.

File details

Details for the file trillim-0.5.1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for trillim-0.5.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3987a7e922be69b13b27dff58260ffc1d4309a4cb71f2e9e61569a1ac3ba4b19
MD5 70f7ddda61bf8527e84cf4cfba4c7338
BLAKE2b-256 78794d1d61e880ae1058dbbb4c8ce8b80ac758fa5f9718c2cfbe9efe3f6a0962

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page