Ultra-fast local LLM inference with zero-config hardware-optimized speculative decoding.
Project description
hexonit-llm ๐
๐ Can I Run This Model?
Check before downloading whether your hardware supports a model:
from hexonit_llm import UltraInference
# Static check โ no model loading required
advice = UltraInference.check("meta-llama/Meta-Llama-3-70B-Instruct")
print(advice)
# โ
Can run | Recommended: Q4_K_M | Est. VRAM: 38.5GB / 80.0GB available (52% headroom)
# 70B parameter model at Q4_K_M uses ~38.5GB including KV cache overhead.
# Or if you don't have enough VRAM:
# โ Cannot run | Need 38.5GB, have 8.0GB (deficit: 30.5GB)
# ๐ก Try instead: meta-llama/Meta-Llama-3-8B-Instruct (8B) fits at Q4_K_M
Philosophy
"One import. That's all."
hexonit-llm is an intelligent orchestrator that:
- Inspects your hardware โ OS, VRAM, system RAM, CPU
- Selects the fastest engine โ vLLM (Linux, โฅ16GB VRAM) or llama.cpp (Windows/macOS/Linux)
- Enables speculative decoding โ automatically downloads the matching draft model
- Delivers maximum tokens/sec โ hardcoded, battle-tested optimisation presets
All with zero configuration.
Quick Start
Installation
pip install hexonit-llm # core dependencies only
pip install hexonit-llm[vllm] # + vLLM (Linux only)
pip install hexonit-llm[llamacpp] # + llama.cpp (Windows/macOS/Linux)
pip install hexonit-llm[cloud] # + httpx for cloud draft
Usage
from hexonit_llm import UltraInference
# That's it. One line.
pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")
# Generate text
response = pipe.generate("What is the meaning of life?")
print(response)
# Batch generation
responses = pipe.generate_batch([
"Tell me a joke",
"What is 2+2?",
])
# Chat interface
reply = pipe.chat([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
])
Check what's running
pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")
print(pipe.engine_name) # "vllm" or "llamacpp"
print(pipe.draft_model) # "meta-llama/Llama-3.2-3B-Instruct"
print(pipe.hardware_info)
โก Benchmarks
Run your own benchmark:
pipe = UltraInference("meta-llama/Meta-Llama-3-8B-Instruct")
stats = pipe.benchmark(runs=10)
# ๐ฅ Benchmarking llamacpp with 10 runs...
# Run 1/10: 47.3 tok/s
# ...
# ๐ Results: 45.8 tok/s average (llamacpp)
Community benchmark results welcome! Open a PR to add yours to docs/benchmarks.md.
Supported Model Families
| Family | Target Model | Auto-selected Draft |
|---|---|---|
| Meta LLaMA 3 | Meta-Llama-3-70B-Instruct |
Llama-3.2-3B-Instruct |
| Meta LLaMA 3 | Meta-Llama-3-8B-Instruct |
Llama-3.2-1B-Instruct |
| Qwen 2.5 | Qwen2.5-72B-Instruct |
Qwen2.5-1.5B-Instruct |
| Mixtral | Mixtral-8x22B-Instruct |
Ministral-8B-Instruct |
| Gemma 2 | gemma-2-27b-it |
gemma-2-2b-it |
| DeepSeek | DeepSeek-V2.5 |
deepseek-llm-7b-chat |
| Phi-3 | Phi-3-medium-4k-instruct |
Phi-3-mini-4k-instruct |
| โฆ and many more | See model_mappings.py |
Architecture
hexonit_llm/
โโโ __init__.py # UltraInference โ the public API
โโโ orchestrator.py # The brain: hardware routing + engine factory
โโโ engines/
โ โโโ base.py # Abstract base engine
โ โโโ vllm_engine.py # vLLM backend (PagedAttention, FlashAttention-2)
โ โโโ llamacpp_engine.py # llama.cpp backend (GGUF offloading)
โโโ config/
โ โโโ model_mappings.py # 30+ targetโdraft model mappings
โโโ utils/
โโโ hardware_detector.py # OS, VRAM, RAM detection
โโโ model_mapper.py # HF Hub download & caching
โโโ quantization_advisor.py # Pre-download VRAM analysis
Routing Logic
UltraInference(model)
โ
โโโ OS = Linux & VRAM โฅ 16GB โโ> vLLM (FlashAttention-2, PagedAttention)
โ
โโโ OS = Windows / macOS
or VRAM < 16GB โโ> llama.cpp (GGUF, GPU offloading)
Speculative decoding is always enabled when a matching draft model exists.
๐ Compared to Alternatives
| Feature | hexonit-llm | Ollama | vLLM direct | llama.cpp direct |
|---|---|---|---|---|
| Zero config | โ | โ | โ | โ |
| Auto engine selection | โ | โ | โ | โ |
| Speculative decoding auto | โ | โ | Manual | โ |
| Pre-download VRAM check | โ | โ | โ | โ |
| Python-native API | โ | Via REST | โ | Via binding |
| Windows support | โ | โ | โ | โ |
| Benchmark built-in | โ | โ | โ | โ |
Performance
The engines ship with hardcoded, max-throughput presets:
| Setting | vLLM | llama.cpp |
|---|---|---|
| GPU Memory Utilisation | 95% | All layers (-1) |
| Batch Size | 256 sequences | 2048 tokens |
| Flash Attention | โ v2 | โ |
| Prefix Caching | โ | N/A |
| CUDA Graphs | โ | N/A |
License
MIT ยฉ 2026 Hexonithy Studios
Contributing
PRs welcome! Please ensure your code passes our checks:
pip install -e ".[dev]"
ruff check .
mypy hexonit_llm
pytest tests/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hexonit_llm-0.1.0.tar.gz.
File metadata
- Download URL: hexonit_llm-0.1.0.tar.gz
- Upload date:
- Size: 23.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24c1f189df07fabd235e4ccbf616f79e577db0a65bab4531b80a8860c092e0f0
|
|
| MD5 |
47ed041130b83c11d68c1baa326e2e97
|
|
| BLAKE2b-256 |
869e4503eb57194c55976cb4020e164f148bd9a3191959a99a1cadcda4f02097
|
File details
Details for the file hexonit_llm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: hexonit_llm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aba249aa09b781c19c7fc226374c61e85c4e5d724b891d34d498fe6148a8e0d5
|
|
| MD5 |
d11bbfa6d66f76482cfb34692720ab84
|
|
| BLAKE2b-256 |
d20b5177e7add767343b5ee91c4ac841fe924118323367e9131119d935e873f0
|