Ultra-fast local LLM inference with zero-config hardware-optimized speculative decoding.

These details have not been verified by PyPI

Project links

Project description

hexonit-llm 🚀

Ultra-fast local LLM inference — zero config, one import, maximum tokens/sec.

🔍 Can I Run This Model?

Check before downloading whether your hardware supports a model:

from hexonit_llm import UltraInference

# Static check — no model loading required
advice = UltraInference.check("meta-llama/Meta-Llama-3-70B-Instruct")
print(advice)
# ✅ Can run | Recommended: Q4_K_M | Est. VRAM: 38.5GB / 80.0GB available (52% headroom)
#    70B parameter model at Q4_K_M uses ~38.5GB including KV cache overhead.

# Or if you don't have enough VRAM:
# ❌ Cannot run | Need 38.5GB, have 8.0GB (deficit: 30.5GB)
#    💡 Try instead: meta-llama/Meta-Llama-3-8B-Instruct (8B) fits at Q4_K_M

Philosophy

"One import. That's all."

hexonit-llm is an intelligent orchestrator that:

Inspects your hardware — OS, VRAM, system RAM, CPU
Selects the fastest engine — vLLM (Linux, ≥16GB VRAM) or llama.cpp (Windows/macOS/Linux)
Enables speculative decoding — automatically downloads the matching draft model
Delivers maximum tokens/sec — hardcoded, battle-tested optimisation presets

All with zero configuration.

Quick Start

Installation

pip install hexonit-llm        # core dependencies only
pip install hexonit-llm[vllm]      # + vLLM (Linux only)
pip install hexonit-llm[llamacpp]  # + llama.cpp (Windows/macOS/Linux)
pip install hexonit-llm[cloud]     # + httpx for cloud draft

Usage

from hexonit_llm import UltraInference

# That's it. One line.
pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")

# Generate text
response = pipe.generate("What is the meaning of life?")
print(response)

# Batch generation
responses = pipe.generate_batch([
    "Tell me a joke",
    "What is 2+2?",
])

# Chat interface
reply = pipe.chat([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
])

Check what's running

pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")
print(pipe.engine_name)     # "vllm" or "llamacpp"
print(pipe.draft_model)     # "meta-llama/Llama-3.2-3B-Instruct"
print(pipe.hardware_info)

⚡ Benchmarks

Run your own benchmark:

pipe = UltraInference("meta-llama/Meta-Llama-3-8B-Instruct")
stats = pipe.benchmark(runs=10)
# 🔥 Benchmarking llamacpp with 10 runs...
#   Run 1/10: 47.3 tok/s
#   ...
# 📊 Results: 45.8 tok/s average (llamacpp)

Community benchmark results welcome! Open a PR to add yours to docs/benchmarks.md.

Supported Model Families

Family	Target Model	Auto-selected Draft
Meta LLaMA 3	`Meta-Llama-3-70B-Instruct`	`Llama-3.2-3B-Instruct`
Meta LLaMA 3	`Meta-Llama-3-8B-Instruct`	`Llama-3.2-1B-Instruct`
Qwen 2.5	`Qwen2.5-72B-Instruct`	`Qwen2.5-1.5B-Instruct`
Mixtral	`Mixtral-8x22B-Instruct`	`Ministral-8B-Instruct`
Gemma 2	`gemma-2-27b-it`	`gemma-2-2b-it`
DeepSeek	`DeepSeek-V2.5`	`deepseek-llm-7b-chat`
Phi-3	`Phi-3-medium-4k-instruct`	`Phi-3-mini-4k-instruct`
… and many more	See model_mappings.py

Architecture

hexonit_llm/
├── __init__.py              # UltraInference – the public API
├── orchestrator.py          # The brain: hardware routing + engine factory
├── engines/
│   ├── base.py              # Abstract base engine
│   ├── vllm_engine.py       # vLLM backend (PagedAttention, FlashAttention-2)
│   └── llamacpp_engine.py   # llama.cpp backend (GGUF offloading)
├── config/
│   └── model_mappings.py    # 30+ target→draft model mappings
└── utils/
    ├── hardware_detector.py # OS, VRAM, RAM detection
    ├── model_mapper.py      # HF Hub download & caching
    └── quantization_advisor.py  # Pre-download VRAM analysis

Routing Logic

UltraInference(model)
    │
    ├── OS = Linux & VRAM ≥ 16GB  ──>  vLLM  (FlashAttention-2, PagedAttention)
    │
    └── OS = Windows / macOS
        or VRAM < 16GB           ──>  llama.cpp  (GGUF, GPU offloading)

Speculative decoding is always enabled when a matching draft model exists.

🆚 Compared to Alternatives

Feature	hexonit-llm	Ollama	vLLM direct	llama.cpp direct
Zero config	✅	✅	❌	❌
Auto engine selection	✅	❌	❌	❌
Speculative decoding auto	✅	❌	Manual	❌
Pre-download VRAM check	✅	❌	❌	❌
Python-native API	✅	Via REST	✅	Via binding
Windows support	✅	✅	❌	✅
Benchmark built-in	✅	❌	❌	❌

Performance

The engines ship with hardcoded, max-throughput presets:

Setting	vLLM	llama.cpp
GPU Memory Utilisation	95%	All layers (-1)
Batch Size	256 sequences	2048 tokens
Flash Attention	✅ v2	✅
Prefix Caching	✅	N/A
CUDA Graphs	✅	N/A

License

Contributing

PRs welcome! Please ensure your code passes our checks:

pip install -e ".[dev]"
ruff check .
mypy hexonit_llm
pytest tests/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 26, 2026

0.0.2

May 26, 2026

0.0.1

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hexonit_llm-0.1.0.tar.gz (23.6 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hexonit_llm-0.1.0-py3-none-any.whl (27.6 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file hexonit_llm-0.1.0.tar.gz.

File metadata

Download URL: hexonit_llm-0.1.0.tar.gz
Upload date: May 26, 2026
Size: 23.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hexonit_llm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`24c1f189df07fabd235e4ccbf616f79e577db0a65bab4531b80a8860c092e0f0`
MD5	`47ed041130b83c11d68c1baa326e2e97`
BLAKE2b-256	`869e4503eb57194c55976cb4020e164f148bd9a3191959a99a1cadcda4f02097`

See more details on using hashes here.

File details

Details for the file hexonit_llm-0.1.0-py3-none-any.whl.

File metadata

Download URL: hexonit_llm-0.1.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 27.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hexonit_llm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aba249aa09b781c19c7fc226374c61e85c4e5d724b891d34d498fe6148a8e0d5`
MD5	`d11bbfa6d66f76482cfb34692720ab84`
BLAKE2b-256	`d20b5177e7add767343b5ee91c4ac841fe924118323367e9131119d935e873f0`

See more details on using hashes here.

hexonit-llm 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hexonit-llm 🚀

🔍 Can I Run This Model?

Philosophy

Quick Start

Installation

Usage

Check what's running

⚡ Benchmarks

Supported Model Families

Architecture

Routing Logic

🆚 Compared to Alternatives

Performance

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes