Ultra-fast local LLM inference with zero-config hardware-optimized speculative decoding.

These details have not been verified by PyPI

Project links

Project description

hexonit-llm 🚀

Ultra-fast local LLM inference — zero config, one import, maximum tokens/sec.

Philosophy

"One import. That's all."

hexonit-llm is an intelligent orchestrator that:

Inspects your hardware — OS, VRAM, system RAM, CPU
Selects the fastest engine — vLLM (Linux, ≥16GB VRAM) or llama.cpp (Windows/macOS/Linux)
Enables speculative decoding — automatically downloads the matching draft model
Delivers maximum tokens/sec — hardcoded, battle-tested optimisation presets

All with zero configuration.

Quick Start

Installation

pip install hexonit-llm        # core dependencies only
pip install hexonit-llm[vllm]      # + vLLM (Linux only)
pip install hexonit-llm[llamacpp]  # + llama.cpp (Windows/macOS/Linux)

Usage

from hexonit_llm import UltraInference

# That's it. One line.
pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")

# Generate text
response = pipe.generate("What is the meaning of life?")
print(response)

# Batch generation
responses = pipe.generate_batch([
    "Tell me a joke",
    "What is 2+2?",
])

# Chat interface
reply = pipe.chat([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
])

Check what's running

pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")
print(pipe.engine_name)     # "vllm" or "llamacpp"
print(pipe.draft_model)     # "meta-llama/Llama-3.2-3B-Instruct"
print(pipe.hardware_info)

Supported Model Families

Family	Target Model	Auto-selected Draft
Meta LLaMA 3	`Meta-Llama-3-70B-Instruct`	`Llama-3.2-3B-Instruct`
Meta LLaMA 3	`Meta-Llama-3-8B-Instruct`	`Llama-3.2-1B-Instruct`
Qwen 2.5	`Qwen2.5-72B-Instruct`	`Qwen2.5-1.5B-Instruct`
Mixtral	`Mixtral-8x22B-Instruct`	`Ministral-8B-Instruct`
Gemma 2	`gemma-2-27b-it`	`gemma-2-2b-it`
DeepSeek	`DeepSeek-V2.5`	`deepseek-llm-7b-chat`
Phi-3	`Phi-3-medium-4k-instruct`	`Phi-3-mini-4k-instruct`
… and many more	See model_mappings.py

Architecture

hexonit_llm/
├── __init__.py              # UltraInference – the public API
├── orchestrator.py          # The brain: hardware routing + engine factory
├── engines/
│   ├── vllm_engine.py       # vLLM backend (PagedAttention, FlashAttention-2)
│   └── llamacpp_engine.py   # llama.cpp backend (GGUF offloading)
├── config/
│   └── model_mappings.py    # 30+ target→draft model mappings
└── utils/
    ├── hardware_detector.py # OS, VRAM, RAM detection
    └── model_mapper.py      # HF Hub download & caching

Routing Logic

UltraInference(model)
    │
    ├── OS = Linux & VRAM ≥ 16GB  ──>  vLLM  (FlashAttention-2, PagedAttention)
    │
    └── OS = Windows / macOS
        or VRAM < 16GB           ──>  llama.cpp  (GGUF, GPU offloading)

Speculative decoding is always enabled when a matching draft model exists.

Performance

The engines ship with hardcoded, max-throughput presets:

Setting	vLLM	llama.cpp
GPU Memory Utilisation	95%	All layers (-1)
Batch Size	256 sequences	2048 tokens
Flash Attention	✅ v2	✅
Prefix Caching	✅	N/A
CUDA Graphs	✅	N/A

License

Contributing

PRs welcome! Please ensure your code passes our checks:

pip install -e ".[dev]"
ruff check .
mypy hexonit_llm
pytest tests/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.0

May 26, 2026

This version

0.0.2

May 26, 2026

0.0.1

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hexonit_llm-0.0.2.tar.gz (19.1 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hexonit_llm-0.0.2-py3-none-any.whl (22.2 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file hexonit_llm-0.0.2.tar.gz.

File metadata

Download URL: hexonit_llm-0.0.2.tar.gz
Upload date: May 26, 2026
Size: 19.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hexonit_llm-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`d59f70a95dd167d287298df5ab60c6ce2b9aa6634415095afae534bf4cfaba66`
MD5	`08ea233a101e5156b58e408ea7b477fd`
BLAKE2b-256	`a9cff7ed4d78eba01fb1e6acbfec932a9c8444346141b4fde7936cc7d2dc4490`

See more details on using hashes here.

File details

Details for the file hexonit_llm-0.0.2-py3-none-any.whl.

File metadata

Download URL: hexonit_llm-0.0.2-py3-none-any.whl
Upload date: May 26, 2026
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hexonit_llm-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0abd50b435471a384cfcc3d1a2359488e94e6ae17f967b2bab0ee128b1503c31`
MD5	`9babb47f5fa00ac2a968e75f683b40cc`
BLAKE2b-256	`a92f55a3372d20abfc36ea329352700cec4785fd8869e2b51b34e82ed8f94592`

See more details on using hashes here.

hexonit-llm 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hexonit-llm 🚀

Philosophy

Quick Start

Installation

Usage

Check what's running

Supported Model Families

Architecture

Routing Logic

Performance

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes