Skip to main content

The fastest inference framework to run BitNet models on CPUs

Project description

Trillim

What is Trillim?

Quick Start

Installation

  • Python 3.12+ required
  • Install with uv (recommended) or pip

Pick your platform for full instructions:

Note: The rest of this README shows bare trillim commands. If you're using uv, prefix each command with uv run (e.g. uv run trillim chat ...).

Quantize your own model

If you have a HuggingFace BitNet model with safetensors weights:

# Quantize model weights → qmodel.tensors + rope.cache
trillim quantize <path-to-model> --model

# Optionally extract a PEFT LoRA adapter → qmodel.lora
trillim quantize <path-to-model> --adapter <path-to-adapter>

Chat

Start an interactive conversation in your terminal:

trillim chat Trillim/BitNet-TRNQ

Multi-turn conversations are supported with automatic prompt caching for fast follow-ups. Use /new to start a fresh conversation, or q to quit.

See the Chat guide for details on LoRA adapters, sampling parameters, and performance tips.

API Server

Trillim includes an OpenAI-compatible API server:

# Start the server
trillim serve Trillim/BitNet-TRNQ

# With voice pipeline (speech-to-text + text-to-speech)
trillim serve Trillim/BitNet-TRNQ --voice

Endpoints:

  • POST /v1/chat/completions — chat completions (streaming supported)
  • POST /v1/completions — text completions
  • GET /v1/models — list loaded models
  • POST /v1/models/load — hot-swap models and LoRA adapters at runtime
  • POST /v1/audio/transcriptions — speech-to-text (with --voice)
  • POST /v1/audio/speech — text-to-speech (with --voice)
  • GET /v1/voices — list available TTS voices
  • POST /v1/voices — register a custom voice from audio (see Voice Cloning Setup)

Works with the OpenAI Python client out of the box:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="BitNet-TRNQ",
    messages=[{"role": "user", "content": "Hello!"}],
)

See the Server guide for full endpoint documentation, request/response schemas, the Python SDK, and voice pipeline usage.

LoRA Adapters

Trillim supports PEFT LoRA adapters as bf16 corrections on top of the ternary base model. The adapter lives in its own directory (separate from the base model) and must be quantized first:

# Quantize a PEFT adapter into Trillim's format
trillim quantize <path-to-base-model> --adapter <path-to-adapter>

# Chat with the base model + adapter
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir>

# Or pull a pre-quantized adapter and use it by ID
trillim pull Trillim/BitNet-GenZ-LoRA-TRNQ
trillim chat Trillim/BitNet-TRNQ --lora Trillim/BitNet-GenZ-LoRA-TRNQ

Adapters can also be hot-swapped at runtime via the API server's POST /v1/models/load endpoint. See the Server guide for details.

Runtime Quantization

Separately from the offline trillim quantize step (which converts model weights to ternary), Trillim can quantize specific layers at inference time to reduce memory usage. This is controlled with two flags available on both chat and serve:

  • --lora-quant <type> — quantize LoRA adapter layers. Options: none, int8, q4_0, q5_0, q6_k, q8_0. Only applies when using --lora.
  • --unembed-quant <type> — quantize the unembedding (output projection) layer. Options: int8, q4_0, q5_0, q6_k, q8_0.
# Quantize LoRA layers to int8 for lower memory
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir> --lora-quant int8

# Quantize the unembed layer to q4_0
trillim chat Trillim/BitNet-TRNQ --unembed-quant q4_0

# Both at once
trillim serve Trillim/BitNet-TRNQ --lora-quant q8_0 --unembed-quant q4_0

Lower quantization levels (e.g. q4_0) use less memory at a small quality cost. These options can also be set per-request when hot-swapping models via POST /v1/models/load. See the CLI reference for the full flag list.

Voice Cloning Setup

The voice pipeline (--voice) includes 8 predefined voices that work out of the box: alba, marius, javert, jean, fantine, cosette, eponine, azelma.

To register custom voices (voice cloning via POST /v1/voices), you need to accept the PocketTTS model terms and authenticate with HuggingFace:

  1. Go to kyutai/pocket-tts on HuggingFace and accept the model's terms.
  2. Create a token on HuggingFace (under Access Tokens) with Read permissions.
  3. Log in locally so the token is available to download the voice cloning weights:
hf auth login

This only needs to be done once. After that, custom voice registration works automatically. If you skip this step, you'll get an error when trying to register a custom voice — predefined voices will still work fine.

Supported Architectures

  • BitnetForCausalLM — BitNet with ternary weights and ReLU² activation
  • LlamaForCausalLM — Llama-style with SiLU activation

Platform Support

Platform Status
x86_64 (AVX2) Supported
ARM64 (NEON) Supported

Thread count is auto-detected as num_cores - 2. Override by passing a --threads N CLI argument.

Documentation

License

The Trillim Python SDK source code is MIT-licensed. The C++ inference engine binaries (inference, trillim-quantize) bundled in the pip package are proprietary — you may use them as part of Trillim but may not reverse-engineer or redistribute them separately. See LICENSE for full terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

trillim-0.2.6-py3-none-win_arm64.whl (1.5 MB view details)

Uploaded Python 3Windows ARM64

trillim-0.2.6-py3-none-win_amd64.whl (1.7 MB view details)

Uploaded Python 3Windows x86-64

trillim-0.2.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (479.8 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

trillim-0.2.6-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (498.4 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

trillim-0.2.6-py3-none-macosx_11_0_x86_64.whl (1.3 MB view details)

Uploaded Python 3macOS 11.0+ x86-64

trillim-0.2.6-py3-none-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file trillim-0.2.6-py3-none-win_arm64.whl.

File metadata

  • Download URL: trillim-0.2.6-py3-none-win_arm64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3, Windows ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.0

File hashes

Hashes for trillim-0.2.6-py3-none-win_arm64.whl
Algorithm Hash digest
SHA256 0b29ecc0afa3fce35930bd9ebc9d832955e44f49b3284f0e4d107faf6b0497ce
MD5 142fa404309d6c9b0254ff86c710ebdb
BLAKE2b-256 058e26355c0defe3ebb2354f1a1c60af6d31b50c9b2417ea5a7d2f14410031ca

See more details on using hashes here.

File details

Details for the file trillim-0.2.6-py3-none-win_amd64.whl.

File metadata

  • Download URL: trillim-0.2.6-py3-none-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.0

File hashes

Hashes for trillim-0.2.6-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 22ccf96c1a1271e378adbb1d3ea241d8ba720109e56f3de4922de4853757c4bf
MD5 5ae302d6bc21f9f8d03a51c98b25eb9e
BLAKE2b-256 7f2438249ff5f9f9655e513a6342e3c3d8b8f049772934571779e5fb928cb3e8

See more details on using hashes here.

File details

Details for the file trillim-0.2.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for trillim-0.2.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b0b0740bbb74e66608c451b61f19277d5126802ddf94d5ffe38f76485e35db18
MD5 1273ef1f7816b4c175c9b6cba2c5480d
BLAKE2b-256 b9e552aec3f894044acd58eac25b808c601563204a56ddd84154445a2e4523b1

See more details on using hashes here.

File details

Details for the file trillim-0.2.6-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for trillim-0.2.6-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a976ff1d55c62ce1ecc3f0743ed58496b38671536e1b9936043b515a887a1267
MD5 ab964a529314a9646cf3b28e77f2aec1
BLAKE2b-256 8b56aa1e683baef6c4b75afa6ecac4820bc932cb32b63fdc6293782ff3080e09

See more details on using hashes here.

File details

Details for the file trillim-0.2.6-py3-none-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for trillim-0.2.6-py3-none-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 bd8904e57b02cfc7bf0683f5db56cc5498b6d91d0feb3aa3213522825ae0483f
MD5 5d1f98e6e08fddb6438b39bbcf9695de
BLAKE2b-256 f518dfa70a618c5efdc6a7d1b839765674a02d9d5b7ad24bb3ef62faf5ba07dd

See more details on using hashes here.

File details

Details for the file trillim-0.2.6-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for trillim-0.2.6-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 611781b3b4f05f804207e715d48bb49bae9b3b76c19902f0de776cec5b670993
MD5 32d041e1390a8f09a558e8fb0ecb2116
BLAKE2b-256 6442451cbdda8cbf9ff37f8ef8fa8137287b7db03e9fb7250583e7bbe68d21f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page