Skip to main content

Fast local LLM inference with TTT (Test-Time Training) and LoRA — the model that learns while it runs

Project description

🧠 Bit-TTT-Engine

PyPI License: MIT Rust

Fast local LLM inference that learns while it runs.

  • 🏎️ 47+ tok/s on RTX 4060 Ti (7B Q4_K_M)
  • 🧠 TTT (Test-Time Training) — adapts during inference (world's first!)
  • 🎨 LoRA — fine-tune with one flag
  • 📦 5 models — Llama-2/3, Gemma-2, Qwen2.5, Mistral
  • 🔌 OpenAI-compatible API — drop-in replacement

🚀 Quick Start

pip install bit-ttt-engine
import cortex_rust

# Load any GGUF model (auto-downloads from HuggingFace!)
model = cortex_rust.load("user/model-GGUF")

# Chat
response = model.chat([
    {"role": "user", "content": "Hello!"}
])
print(response)

# Stream
for token in model.chat_stream([
    {"role": "user", "content": "Tell me a story"}
]):
    print(token, end="", flush=True)

🖥️ CLI

# Interactive chat
bit-ttt chat model.gguf

# Generate text
bit-ttt generate model.gguf -p "Once upon a time"

# OpenAI-compatible API server
bit-ttt serve model.gguf --port 8000

# With LoRA + Q8 KV cache
bit-ttt chat model.gguf --lora adapter.bin --q8-cache

🧠 TTT — Test-Time Training

The model learns while it generates. No other local LLM does this.

model = cortex_rust.load("model.gguf")
model.enable_ttt(True)

# Each conversation makes the model smarter
response = model.chat([{"role": "user", "content": "My name is Alice"}])
# Next time, it remembers context better!

⚡ Performance

Model Speed VRAM
Llama-2 7B (Q4_K_M) 47.8 tok/s ~5 GB
Llama-3 8B (Q4_K_M) 36.8 tok/s ~6 GB
Mistral 7B (Q4_K_M) 40.8 tok/s ~5 GB
Qwen2.5 1.5B (Q4_K_M) 70.4 tok/s ~2 GB

With --q8-cache: 82% VRAM reduction for KV cache.

🔌 OpenAI-Compatible API

bit-ttt serve model.gguf --port 8000
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hi!"}],
    stream=True,
)

📖 Links

💖 License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bit_ttt_engine-0.8.0.tar.gz (406.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bit_ttt_engine-0.8.0-cp310-cp310-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.10Windows x86-64

File details

Details for the file bit_ttt_engine-0.8.0.tar.gz.

File metadata

  • Download URL: bit_ttt_engine-0.8.0.tar.gz
  • Upload date:
  • Size: 406.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.3

File hashes

Hashes for bit_ttt_engine-0.8.0.tar.gz
Algorithm Hash digest
SHA256 8a4e713dd403fff25e88843220a6366135b5ba72317826bd4bf5a05c6661cdb8
MD5 635e47593deec3fe4526973cd411cfa7
BLAKE2b-256 2e918bd2b566e42641d2956f05a816f6538803a1e278deef80e97c2e741dfde2

See more details on using hashes here.

File details

Details for the file bit_ttt_engine-0.8.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for bit_ttt_engine-0.8.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7f3e75d71a9e6bdbb3f0f919d76a52813789c98e9d4aff2121ca47a483933dc1
MD5 eb6d253e2b263c57bc838a37434204b8
BLAKE2b-256 4f83f4657b91ce234d25c6b8805f6eb363cc4a7dddfde89dd5655402c98efc4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page