Skip to main content

Fast local LLM inference with TTT (Test-Time Training) and LoRA — the model that learns while it runs

Project description

🧠 Bit-TTT-Engine

PyPI License: MIT Rust

Fast local LLM inference that learns while it runs.

  • 🏎️ 47+ tok/s on RTX 4060 Ti (7B Q4_K_M)
  • 🧠 TTT (Test-Time Training) — adapts during inference (world's first!)
  • 🎨 LoRA — fine-tune with one flag
  • 📦 5 models — Llama-2/3, Gemma-2, Qwen2.5, Mistral
  • 🔌 OpenAI-compatible API — drop-in replacement

🚀 Quick Start

pip install bit-ttt-engine
import cortex_rust

# Load any GGUF model (auto-downloads from HuggingFace!)
model = cortex_rust.load("user/model-GGUF")

# Chat
response = model.chat([
    {"role": "user", "content": "Hello!"}
])
print(response)

# Stream
for token in model.chat_stream([
    {"role": "user", "content": "Tell me a story"}
]):
    print(token, end="", flush=True)

🖥️ CLI

# Interactive chat
bit-ttt chat model.gguf

# Generate text
bit-ttt generate model.gguf -p "Once upon a time"

# OpenAI-compatible API server
bit-ttt serve model.gguf --port 8000

# With LoRA + Q8 KV cache
bit-ttt chat model.gguf --lora adapter.bin --q8-cache

🧠 TTT — Test-Time Training

The model learns while it generates. No other local LLM does this.

model = cortex_rust.load("model.gguf")
model.enable_ttt(True)

# Each conversation makes the model smarter
response = model.chat([{"role": "user", "content": "My name is Alice"}])
# Next time, it remembers context better!

⚡ Performance

Model Speed VRAM
Llama-2 7B (Q4_K_M) 47.8 tok/s ~5 GB
Llama-3 8B (Q4_K_M) 36.8 tok/s ~6 GB
Mistral 7B (Q4_K_M) 40.8 tok/s ~5 GB
Qwen2.5 1.5B (Q4_K_M) 70.4 tok/s ~2 GB

With --q8-cache: 82% VRAM reduction for KV cache.

🔌 OpenAI-Compatible API

bit-ttt serve model.gguf --port 8000
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hi!"}],
    stream=True,
)

📖 Links

💖 License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bit_ttt_engine-0.7.0.tar.gz (414.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bit_ttt_engine-0.7.0-cp310-cp310-win_amd64.whl (5.4 MB view details)

Uploaded CPython 3.10Windows x86-64

File details

Details for the file bit_ttt_engine-0.7.0.tar.gz.

File metadata

  • Download URL: bit_ttt_engine-0.7.0.tar.gz
  • Upload date:
  • Size: 414.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.3

File hashes

Hashes for bit_ttt_engine-0.7.0.tar.gz
Algorithm Hash digest
SHA256 3a9d49dab0b32130ad39fd7e0b9ad1ae8567e356a277143521e14f105a32c2f0
MD5 50953c7d0f21198adf8f3a2d9552673f
BLAKE2b-256 d1e3002078d4a4229205893cbd3accb858a759ae73db567e1aa7eed8dd7291b7

See more details on using hashes here.

File details

Details for the file bit_ttt_engine-0.7.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for bit_ttt_engine-0.7.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0a1724d04e58774427fb07df7573e1ecaff119c257349543128f1112d10c2795
MD5 02c2d5ff1f6bfaa9a253260289668b38
BLAKE2b-256 66b29733d670f2660713cecee8a6e3cad88b9041a67b417e5da811c24aafead0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page