Skip to main content

The fastest inference framework to run BitNet models on CPUs

Project description

Trillim

Trillim is the platform for everything local AI. DarkNet is the CPU inference engine powering Trillim.

Install

  • Python 3.12+ required
  • Linux also requires glibc 2.27+
  • uv is the recommended installer

Platform guides:

If you installed with uv, prefix the CLI examples below with uv run.

Common Workflows

Pull a Model

trillim list
trillim pull Trillim/BitNet-TRNQ

Chat in the Terminal

trillim chat Trillim/BitNet-TRNQ

trillim chat keeps multi-turn history, preserves exact token continuity for prior turns, and reuses the KV cache whenever the next turn can safely append to that exact prompt state. Use /new to reset the conversation or q to quit.

Search-Augmented Chat

Use the search harness with a search-tuned model:

trillim chat Trillim/BitNet-Search-TRNQ --harness search

DuckDuckGo (ddgs) is the default provider. To use Brave:

export SEARCH_API_KEY=<your_api_key>
trillim chat Trillim/BitNet-Search-TRNQ --harness search --search-provider brave

Serve an OpenAI-Compatible API

Start the server:

trillim serve Trillim/BitNet-TRNQ

Main endpoints:

  • POST /v1/chat/completions
  • POST /v1/completions
  • GET /v1/models
  • POST /v1/models/load

Example with the OpenAI Python client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="BitNet-TRNQ",
    messages=[{"role": "user", "content": "Hello!"}],
)

To switch a running server to the search harness, call POST /v1/models/load with "harness": "search" and optional "search_provider": "ddgs" | "brave".

Quantize a Model or Adapter

If you have a HuggingFace model with safetensors weights (currently only supports BitNet models):

# Quantize model weights -> qmodel.tensors + rope.cache
trillim quantize <path-to-model> --model

# Extract a PEFT LoRA adapter -> qmodel.lora
trillim quantize <path-to-model> --adapter <path-to-adapter>

Use a LoRA Adapter

# Quantize a PEFT adapter into Trillim's format
trillim quantize <path-to-base-model> --adapter <path-to-adapter>

# Run the base model with the adapter
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir>

# Or pull a pre-quantized adapter and use it by ID
trillim pull Trillim/BitNet-GenZ-LoRA-TRNQ
trillim chat Trillim/BitNet-TRNQ --lora Trillim/BitNet-GenZ-LoRA-TRNQ

The same adapter settings can be changed at runtime through POST /v1/models/load.

Runtime Quantization

Runtime quantization reduces memory use for selected layers during inference:

  • --lora-quant <type> for LoRA layers: none, bf16, int8, q4_0, q5_0, q6_k, q8_0
  • --unembed-quant <type> for the unembedding layer: int8, q4_0, q5_0, q6_k, q8_0
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir> --lora-quant int8
trillim chat Trillim/BitNet-TRNQ --unembed-quant q4_0
trillim serve Trillim/BitNet-TRNQ --lora-quant q8_0 --unembed-quant q4_0

Voice Support

Install the optional voice extra before using speech endpoints:

uv add "trillim[voice]"

Or with pip:

pip install "trillim[voice]"

Then start the server with:

trillim serve Trillim/BitNet-TRNQ --voice

Voice endpoints:

  • POST /v1/audio/transcriptions
  • POST /v1/audio/speech
  • GET /v1/voices
  • POST /v1/voices

Predefined voices are alba, marius, javert, jean, fantine, cosette, eponine, and azelma.

For custom voice registration through POST /v1/voices, accept the terms for kyutai/pocket-tts, create a HuggingFace token with Read access, and run:

hf auth login

Custom voice uploads through POST /v1/voices are limited to 8 MB per file.

That setup is only required once. Predefined voices work without it.

Performance Highlights

Benchmark takeaways for DarkNet on consumer CPUs:

  • Prefill throughput improvements are most visible when num_threads >= 4.
  • Decode throughput is broadly comparable to bitnet.cpp on average, while DarkNet reaches higher peaks.
  • Results are directional and depend on thermal behavior, boost policy, and memory bandwidth.

Prefill example:

Prefill benchmark example

Decode example:

Decode benchmark example

Supported Architectures

  • BitnetForCausalLM for ternary BitNet models with ReLU² activation
  • LlamaForCausalLM for Llama-style models with SiLU activation

Platform Support

Platform Status
x86_64 (AVX2) Supported
ARM64 (NEON) Supported

Thread count defaults to num_cores - 2. Override it with --threads N.

Documentation

License

The Trillim Python SDK source code is MIT-licensed. The C++ inference engine binaries (inference, trillim-quantize) bundled in the pip package are proprietary. You may use them as part of Trillim, but may not reverse-engineer or redistribute them separately. See LICENSE for the full terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

trillim-0.6.0-py3-none-win_arm64.whl (2.4 MB view details)

Uploaded Python 3Windows ARM64

trillim-0.6.0-py3-none-win_amd64.whl (2.5 MB view details)

Uploaded Python 3Windows x86-64

trillim-0.6.0-py3-none-manylinux_2_27_x86_64.manylinux2014_x86_64.whl (12.5 MB view details)

Uploaded Python 3manylinux: glibc 2.27+ x86-64

trillim-0.6.0-py3-none-manylinux_2_27_aarch64.manylinux2014_aarch64.whl (12.8 MB view details)

Uploaded Python 3manylinux: glibc 2.27+ ARM64

trillim-0.6.0-py3-none-macosx_11_0_x86_64.whl (2.2 MB view details)

Uploaded Python 3macOS 11.0+ x86-64

trillim-0.6.0-py3-none-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file trillim-0.6.0-py3-none-win_arm64.whl.

File metadata

  • Download URL: trillim-0.6.0-py3-none-win_arm64.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: Python 3, Windows ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.0

File hashes

Hashes for trillim-0.6.0-py3-none-win_arm64.whl
Algorithm Hash digest
SHA256 875422246a94bb4401799eb0072432efb87147f5bb05533eec5400f8db330a0b
MD5 88ca70269e9f60eaa5353319b830fad3
BLAKE2b-256 75a7c66879789260d9506e9b15e727cebcf2a13597ba02880f7f4335f594a3a3

See more details on using hashes here.

File details

Details for the file trillim-0.6.0-py3-none-win_amd64.whl.

File metadata

  • Download URL: trillim-0.6.0-py3-none-win_amd64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.0

File hashes

Hashes for trillim-0.6.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 4b3e956ee6f1c935a87e76d4c62abc6b70ffcedb1d11252bba34a13c4e8169cc
MD5 34765d21c991de16c4a5f319a1fd08da
BLAKE2b-256 c888ae5c46c4b1807fc2578dcac62511f154537dd90c237ebf6c656c13aca1b8

See more details on using hashes here.

File details

Details for the file trillim-0.6.0-py3-none-manylinux_2_27_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for trillim-0.6.0-py3-none-manylinux_2_27_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2eff2fd04fd004699d793278dcdeb6171927bb48047fdf7a7c163bce5557a28f
MD5 2f9a34cffa4feac49b7aa5e0607338b0
BLAKE2b-256 7d842499986f0f090cb0a418a192b1e2e8f348fe411cb2618db0611a4ba3750a

See more details on using hashes here.

File details

Details for the file trillim-0.6.0-py3-none-manylinux_2_27_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for trillim-0.6.0-py3-none-manylinux_2_27_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ddca33a97d2eee1e54847cce906d047ab2717edd8b2c86e2738b985ddaf8162d
MD5 3de5c4d2b41e018551085a536c736a07
BLAKE2b-256 cb0a0d96cfcf6363ef90475c51eb70a63a58ce342903509a786337ef604ccdcb

See more details on using hashes here.

File details

Details for the file trillim-0.6.0-py3-none-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for trillim-0.6.0-py3-none-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 b6064a381aa7b627d72c606c0034bec0c3f083a35af512d61c8b8bb0dcfd0e24
MD5 c75064cba73d25e7d6ffef5ec6e597c1
BLAKE2b-256 396f8b9346b71eb37191703b733ac9684e38b52f2781e27298efe5f96340a91d

See more details on using hashes here.

File details

Details for the file trillim-0.6.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for trillim-0.6.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2ba505307a9f08d5f0b5810ade9373f2050e4a860595b3a7470eeb83f1be9989
MD5 c8c4ac35f8f6fe4a89818f0ffb118300
BLAKE2b-256 0c47ad4b97797f5b30376b822b62c9f72d40779ff8d72822fa488bad99e0c704

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page