Skip to main content

Ultra-fast local LLM inference with zero-config hardware-optimized speculative decoding.

Project description

hexonit-llm ๐Ÿš€

Ultra-fast local LLM inference โ€” zero config, one import, maximum tokens/sec.

CI PyPI version Python 3.10+ License: MIT Downloads


๐Ÿ” Can I Run This Model?

Check before downloading whether your hardware supports a model:

from hexonit_llm import UltraInference

# Static check โ€” no model loading required
advice = UltraInference.check("meta-llama/Meta-Llama-3-70B-Instruct")
print(advice)
# โœ… Can run | Recommended: Q4_K_M | Est. VRAM: 38.5GB / 80.0GB available (52% headroom)
#    70B parameter model at Q4_K_M uses ~38.5GB including KV cache overhead.

# Or if you don't have enough VRAM:
# โŒ Cannot run | Need 38.5GB, have 8.0GB (deficit: 30.5GB)
#    ๐Ÿ’ก Try instead: meta-llama/Meta-Llama-3-8B-Instruct (8B) fits at Q4_K_M

Philosophy

"One import. That's all."

hexonit-llm is an intelligent orchestrator that:

  1. Inspects your hardware โ€” OS, VRAM, system RAM, CPU
  2. Selects the fastest engine โ€” vLLM (Linux, โ‰ฅ16GB VRAM) or llama.cpp (Windows/macOS/Linux)
  3. Enables speculative decoding โ€” automatically downloads the matching draft model
  4. Delivers maximum tokens/sec โ€” hardcoded, battle-tested optimisation presets

All with zero configuration.


Quick Start

Installation

pip install hexonit-llm        # core dependencies only
pip install hexonit-llm[vllm]      # + vLLM (Linux only)
pip install hexonit-llm[llamacpp]  # + llama.cpp (Windows/macOS/Linux)
pip install hexonit-llm[cloud]     # + httpx for cloud draft

Usage

from hexonit_llm import UltraInference

# That's it. One line.
pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")

# Generate text
response = pipe.generate("What is the meaning of life?")
print(response)

# Batch generation
responses = pipe.generate_batch([
    "Tell me a joke",
    "What is 2+2?",
])

# Chat interface
reply = pipe.chat([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
])

Check what's running

pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")
print(pipe.engine_name)     # "vllm" or "llamacpp"
print(pipe.draft_model)     # "meta-llama/Llama-3.2-3B-Instruct"
print(pipe.hardware_info)

โšก Benchmarks

Run your own benchmark:

pipe = UltraInference("meta-llama/Meta-Llama-3-8B-Instruct")
stats = pipe.benchmark(runs=10)
# ๐Ÿ”ฅ Benchmarking llamacpp with 10 runs...
#   Run 1/10: 47.3 tok/s
#   ...
# ๐Ÿ“Š Results: 45.8 tok/s average (llamacpp)

Community benchmark results welcome! Open a PR to add yours to docs/benchmarks.md.


Supported Model Families

Family Target Model Auto-selected Draft
Meta LLaMA 3 Meta-Llama-3-70B-Instruct Llama-3.2-3B-Instruct
Meta LLaMA 3 Meta-Llama-3-8B-Instruct Llama-3.2-1B-Instruct
Qwen 2.5 Qwen2.5-72B-Instruct Qwen2.5-1.5B-Instruct
Mixtral Mixtral-8x22B-Instruct Ministral-8B-Instruct
Gemma 2 gemma-2-27b-it gemma-2-2b-it
DeepSeek DeepSeek-V2.5 deepseek-llm-7b-chat
Phi-3 Phi-3-medium-4k-instruct Phi-3-mini-4k-instruct
โ€ฆ and many more See model_mappings.py

Architecture

hexonit_llm/
โ”œโ”€โ”€ __init__.py              # UltraInference โ€“ the public API
โ”œโ”€โ”€ orchestrator.py          # The brain: hardware routing + engine factory
โ”œโ”€โ”€ engines/
โ”‚   โ”œโ”€โ”€ base.py              # Abstract base engine
โ”‚   โ”œโ”€โ”€ vllm_engine.py       # vLLM backend (PagedAttention, FlashAttention-2)
โ”‚   โ””โ”€โ”€ llamacpp_engine.py   # llama.cpp backend (GGUF offloading)
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ model_mappings.py    # 30+ targetโ†’draft model mappings
โ””โ”€โ”€ utils/
    โ”œโ”€โ”€ hardware_detector.py # OS, VRAM, RAM detection
    โ”œโ”€โ”€ model_mapper.py      # HF Hub download & caching
    โ””โ”€โ”€ quantization_advisor.py  # Pre-download VRAM analysis

Routing Logic

UltraInference(model)
    โ”‚
    โ”œโ”€โ”€ OS = Linux & VRAM โ‰ฅ 16GB  โ”€โ”€>  vLLM  (FlashAttention-2, PagedAttention)
    โ”‚
    โ””โ”€โ”€ OS = Windows / macOS
        or VRAM < 16GB           โ”€โ”€>  llama.cpp  (GGUF, GPU offloading)

Speculative decoding is always enabled when a matching draft model exists.


๐Ÿ†š Compared to Alternatives

Feature hexonit-llm Ollama vLLM direct llama.cpp direct
Zero config โœ… โœ… โŒ โŒ
Auto engine selection โœ… โŒ โŒ โŒ
Speculative decoding auto โœ… โŒ Manual โŒ
Pre-download VRAM check โœ… โŒ โŒ โŒ
Python-native API โœ… Via REST โœ… Via binding
Windows support โœ… โœ… โŒ โœ…
Benchmark built-in โœ… โŒ โŒ โŒ

Performance

The engines ship with hardcoded, max-throughput presets:

Setting vLLM llama.cpp
GPU Memory Utilisation 95% All layers (-1)
Batch Size 256 sequences 2048 tokens
Flash Attention โœ… v2 โœ…
Prefix Caching โœ… N/A
CUDA Graphs โœ… N/A

License

MIT ยฉ 2026 Hexonithy Studios


Contributing

PRs welcome! Please ensure your code passes our checks:

pip install -e ".[dev]"
ruff check .
mypy hexonit_llm
pytest tests/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hexonit_llm-0.1.0.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hexonit_llm-0.1.0-py3-none-any.whl (27.6 kB view details)

Uploaded Python 3

File details

Details for the file hexonit_llm-0.1.0.tar.gz.

File metadata

  • Download URL: hexonit_llm-0.1.0.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hexonit_llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 24c1f189df07fabd235e4ccbf616f79e577db0a65bab4531b80a8860c092e0f0
MD5 47ed041130b83c11d68c1baa326e2e97
BLAKE2b-256 869e4503eb57194c55976cb4020e164f148bd9a3191959a99a1cadcda4f02097

See more details on using hashes here.

File details

Details for the file hexonit_llm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: hexonit_llm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hexonit_llm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aba249aa09b781c19c7fc226374c61e85c4e5d724b891d34d498fe6148a8e0d5
MD5 d11bbfa6d66f76482cfb34692720ab84
BLAKE2b-256 d20b5177e7add767343b5ee91c4ac841fe924118323367e9131119d935e873f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page