Ultra-fast local LLM inference with zero-config hardware-optimized speculative decoding.
Project description
hexonit-llm 🚀
Philosophy
"One import. That's all."
hexonit-llm is an intelligent orchestrator that:
- Inspects your hardware — OS, VRAM, system RAM, CPU
- Selects the fastest engine — vLLM (Linux, ≥16GB VRAM) or llama.cpp (Windows/macOS/Linux)
- Enables speculative decoding — automatically downloads the matching draft model
- Delivers maximum tokens/sec — hardcoded, battle-tested optimisation presets
All with zero configuration.
Quick Start
Installation
pip install hexonit-llm # core dependencies only
pip install hexonit-llm[vllm] # + vLLM (Linux only)
pip install hexonit-llm[llamacpp] # + llama.cpp (Windows/macOS/Linux)
Usage
from hexonit_llm import UltraInference
# That's it. One line.
pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")
# Generate text
response = pipe.generate("What is the meaning of life?")
print(response)
# Batch generation
responses = pipe.generate_batch([
"Tell me a joke",
"What is 2+2?",
])
# Chat interface
reply = pipe.chat([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
])
Check what's running
pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")
print(pipe.engine_name) # "vllm" or "llamacpp"
print(pipe.draft_model) # "meta-llama/Llama-3.2-3B-Instruct"
print(pipe.hardware_info)
Supported Model Families
| Family | Target Model | Auto-selected Draft |
|---|---|---|
| Meta LLaMA 3 | Meta-Llama-3-70B-Instruct |
Llama-3.2-3B-Instruct |
| Meta LLaMA 3 | Meta-Llama-3-8B-Instruct |
Llama-3.2-1B-Instruct |
| Qwen 2.5 | Qwen2.5-72B-Instruct |
Qwen2.5-1.5B-Instruct |
| Mixtral | Mixtral-8x22B-Instruct |
Ministral-8B-Instruct |
| Gemma 2 | gemma-2-27b-it |
gemma-2-2b-it |
| DeepSeek | DeepSeek-V2.5 |
deepseek-llm-7b-chat |
| Phi-3 | Phi-3-medium-4k-instruct |
Phi-3-mini-4k-instruct |
| … and many more | See model_mappings.py |
Architecture
hexonit_llm/
├── __init__.py # UltraInference – the public API
├── orchestrator.py # The brain: hardware routing + engine factory
├── engines/
│ ├── vllm_engine.py # vLLM backend (PagedAttention, FlashAttention-2)
│ └── llamacpp_engine.py # llama.cpp backend (GGUF offloading)
├── config/
│ └── model_mappings.py # 30+ target→draft model mappings
└── utils/
├── hardware_detector.py # OS, VRAM, RAM detection
└── model_mapper.py # HF Hub download & caching
Routing Logic
UltraInference(model)
│
├── OS = Linux & VRAM ≥ 16GB ──> vLLM (FlashAttention-2, PagedAttention)
│
└── OS = Windows / macOS
or VRAM < 16GB ──> llama.cpp (GGUF, GPU offloading)
Speculative decoding is always enabled when a matching draft model exists.
Performance
The engines ship with hardcoded, max-throughput presets:
| Setting | vLLM | llama.cpp |
|---|---|---|
| GPU Memory Utilisation | 95% | All layers (-1) |
| Batch Size | 256 sequences | 2048 tokens |
| Flash Attention | ✅ v2 | ✅ |
| Prefix Caching | ✅ | N/A |
| CUDA Graphs | ✅ | N/A |
License
MIT © 2025 Hexonithy Studios
Contributing
PRs welcome! Please ensure your code passes our checks:
pip install -e ".[dev]"
ruff check .
mypy hexonit_llm
pytest tests/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hexonit_llm-0.0.1.tar.gz.
File metadata
- Download URL: hexonit_llm-0.0.1.tar.gz
- Upload date:
- Size: 18.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa12234c8438a3020c183f6c47faa68fcf3d04064ba2dd3b964eb9b18838b34a
|
|
| MD5 |
7642e98d1eca3b9a6cf9e2c07b039dcb
|
|
| BLAKE2b-256 |
7441fc1f42155043f8962920496aec7b9ca4d2d55e91d4e17e6b0bf29d41d72c
|
File details
Details for the file hexonit_llm-0.0.1-py3-none-any.whl.
File metadata
- Download URL: hexonit_llm-0.0.1-py3-none-any.whl
- Upload date:
- Size: 22.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdb6964889dfb597bae004eca6d6f866784026c75d6b6a62a8959d3c96bed4f7
|
|
| MD5 |
7a3b92310d001731f6f6b78ecfb72a2b
|
|
| BLAKE2b-256 |
8a8a40fa58c152dc4c6d770a2862286b5d3b303cf4433c34aaff5274c0dc78e1
|