T-LEX: Edge-Decoupled LLM Inference - Run 32B models on ANY device!
Project description
T-LEX Edge
Edge-Decoupled LLM Inference - Run 32B+ models on ANY device!
The Problem
Running large language models locally requires expensive GPU hardware:
| Model | VRAM Required | Cost |
|---|---|---|
| 7B | ~14GB | $500+ |
| 32B | ~24GB | $1000+ |
| 70B | ~40GB | $2000+ |
Most edge devices (laptops, IoT, drones, phones) can't run these models.
The Solution: Split-Brain Architecture
T-LEX separates inference (GPU server) from decoding (edge device):
┌─────────────────────────┐ ┌─────────────────────────┐
│ GPU Server │ │ Edge Device │
│ ┌─────────────────┐ │ tokens │ ┌─────────────────┐ │
│ │ Ollama/OomLlama │────┼────────►│ │ T-LEX Decoder │ │
│ │ 32B model │ │ │ │ vocab.db (6MB) │ │
│ │ 24GB VRAM │ │ │ │ NO GPU! │ │
│ └─────────────────┘ │ │ └─────────────────┘ │
└─────────────────────────┘ └─────────────────────────┘
Performance
| Metric | Local 32B | T-LEX (Remote 32B + Edge Decode) |
|---|---|---|
| GPU Required (Edge) | 24GB VRAM | None! |
| Edge Storage | 20GB+ | 6 MB |
| Generation Speed | 2 tok/s | 16 tok/s |
| Decode Speed | N/A | 45,000 tok/s |
Result: 8x faster with zero GPU on edge!
Installation
# Basic installation
pip install tlex-edge
# With server support (FastAPI)
pip install tlex-edge[server]
# With RAG support (ChromaDB)
pip install tlex-edge[rag]
# Full installation
pip install tlex-edge[full]
Quick Start
Python API
from tlex import TLexClient, EdgeDecoder
# Connect to remote GPU server
client = TLexClient("http://gpu-server:11434")
# Generate with 32B model - no local GPU needed!
result = client.generate(
"Explain quantum computing",
model="humotica-32b",
max_tokens=100
)
print(result.text)
print(f"Speed: {result.tokens_per_second:.1f} tok/s")
# Streaming output
for chunk in client.stream("Tell me a story"):
print(chunk, end="", flush=True)
Command Line
# Generate text
tlex generate "What is AI?" --model qwen2.5:7b --server http://gpu-server:11434
# Interactive chat
tlex chat --model humotica-32b
# List available models
tlex models
# Benchmark decoder
tlex benchmark --vocab qwen_vocab.db
Docker
# Build
docker build -t tlex-edge .
# Run
docker run -it tlex-edge generate "Hello!" --model qwen2.5:7b
# With docker-compose
docker-compose up -d
docker-compose exec tlex chat
Building Vocabulary Database
The vocab database is all an edge device needs to decode tokens:
# From command line
tlex vocab Qwen/Qwen2.5-7B-Instruct --output qwen_vocab.db
# From Python
from tlex import build_vocab_db
build_vocab_db("Qwen/Qwen2.5-7B-Instruct", "qwen_vocab.db")
Size comparison:
- Qwen 7B model: ~14 GB
- qwen_vocab.db: ~6 MB (2300x smaller!)
Server Setup
T-LEX works with any Ollama-compatible server:
# On your GPU server (P520, etc.)
ollama serve
# Pull models
ollama pull qwen2.5:7b
ollama pull qwen2.5:32b
Architecture
┌─────────────────────────────────────────────────┐
│ GPU SERVER │
│ ┌─────────────────────────────────────────┐ │
│ │ Ollama / OomLlama │ │
│ │ - Qwen 7B/32B/72B │ │
│ │ - LLaMA 70B │ │
│ │ - Any GGUF model │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ │ HTTP Stream │
│ │ (token chunks) │
└──────────────────────┼───────────────────────────┘
│
┌────────────────┴────────────────┐
│ NETWORK │
│ (LAN / Internet / I-Poll) │
└────────────────┬────────────────┘
│
┌────────────────────────────────────┴────────────────────────────────────┐
│ EDGE DEVICES │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Laptop │ │ Raspberry │ │ Phone │ │ Drone │ │
│ │ │ │ Pi │ │ │ │ │ │
│ │ vocab.db 6MB │ │ vocab.db 6MB │ │ vocab.db 6MB │ │ vocab.db 6MB │ │
│ │ NO GPU! │ │ NO GPU! │ │ NO GPU! │ │ NO GPU! │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Use Cases
- IoT Devices: Smart home with AI, no cloud dependency
- Drones: On-board AI decisions, low latency
- Mobile Apps: Full LLM power without draining battery
- Air-gapped Networks: Self-hosted inference + edge decode
- Cost Reduction: One GPU server, unlimited edge clients
Part of HumoticaOS
T-LEX is part of the HumoticaOS ecosystem:
- TIBET: Trust & provenance for AI actions
- I-Poll: AI-to-AI messaging
- OomLlama: Native Rust inference engine
One love, one fAmIly! 🦙❤️
License
MIT License - see LICENSE for details.
Contributing
Contributions welcome! Please read CONTRIBUTING.md first.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tlex_edge-0.1.0.tar.gz.
File metadata
- Download URL: tlex_edge-0.1.0.tar.gz
- Upload date:
- Size: 2.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccbaabab57bf8bfda4c7ddaef99477ef250eeb93b48dc94b66d44cc9b6ecead8
|
|
| MD5 |
bc2abf73cf1ac3d45087c0ad44189696
|
|
| BLAKE2b-256 |
937b13a299a78801df506c28c20d7c742c27672b39299f32ee65d035276f86ef
|
File details
Details for the file tlex_edge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tlex_edge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33770998d68871d62596f138a9a3eb7314033520afa6866c53d97a7e1e8bd8f2
|
|
| MD5 |
6b5f9c23067c8daf80fe60437dea0c98
|
|
| BLAKE2b-256 |
a613aeb3fee9025f9976771a7867241067d7db5dcaae91e5492c09802cf181c4
|