Skip to main content

Ultra-fast local LLM inference with zero-config hardware-optimized speculative decoding.

Project description

hexonit-llm 🚀

Ultra-fast local LLM inference — zero config, one import, maximum tokens/sec.

Python 3.10+ License: MIT Code style: black


Philosophy

"One import. That's all."

hexonit-llm is an intelligent orchestrator that:

  1. Inspects your hardware — OS, VRAM, system RAM, CPU
  2. Selects the fastest engine — vLLM (Linux, ≥16GB VRAM) or llama.cpp (Windows/macOS/Linux)
  3. Enables speculative decoding — automatically downloads the matching draft model
  4. Delivers maximum tokens/sec — hardcoded, battle-tested optimisation presets

All with zero configuration.


Quick Start

Installation

pip install hexonit-llm        # core dependencies only
pip install hexonit-llm[vllm]      # + vLLM (Linux only)
pip install hexonit-llm[llamacpp]  # + llama.cpp (Windows/macOS/Linux)

Usage

from hexonit_llm import UltraInference

# That's it. One line.
pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")

# Generate text
response = pipe.generate("What is the meaning of life?")
print(response)

# Batch generation
responses = pipe.generate_batch([
    "Tell me a joke",
    "What is 2+2?",
])

# Chat interface
reply = pipe.chat([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
])

Check what's running

pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")
print(pipe.engine_name)     # "vllm" or "llamacpp"
print(pipe.draft_model)     # "meta-llama/Llama-3.2-3B-Instruct"
print(pipe.hardware_info)

Supported Model Families

Family Target Model Auto-selected Draft
Meta LLaMA 3 Meta-Llama-3-70B-Instruct Llama-3.2-3B-Instruct
Meta LLaMA 3 Meta-Llama-3-8B-Instruct Llama-3.2-1B-Instruct
Qwen 2.5 Qwen2.5-72B-Instruct Qwen2.5-1.5B-Instruct
Mixtral Mixtral-8x22B-Instruct Ministral-8B-Instruct
Gemma 2 gemma-2-27b-it gemma-2-2b-it
DeepSeek DeepSeek-V2.5 deepseek-llm-7b-chat
Phi-3 Phi-3-medium-4k-instruct Phi-3-mini-4k-instruct
… and many more See model_mappings.py

Architecture

hexonit_llm/
├── __init__.py              # UltraInference – the public API
├── orchestrator.py          # The brain: hardware routing + engine factory
├── engines/
│   ├── vllm_engine.py       # vLLM backend (PagedAttention, FlashAttention-2)
│   └── llamacpp_engine.py   # llama.cpp backend (GGUF offloading)
├── config/
│   └── model_mappings.py    # 30+ target→draft model mappings
└── utils/
    ├── hardware_detector.py # OS, VRAM, RAM detection
    └── model_mapper.py      # HF Hub download & caching

Routing Logic

UltraInference(model)
    │
    ├── OS = Linux & VRAM ≥ 16GB  ──>  vLLM  (FlashAttention-2, PagedAttention)
    │
    └── OS = Windows / macOS
        or VRAM < 16GB           ──>  llama.cpp  (GGUF, GPU offloading)

Speculative decoding is always enabled when a matching draft model exists.


Performance

The engines ship with hardcoded, max-throughput presets:

Setting vLLM llama.cpp
GPU Memory Utilisation 95% All layers (-1)
Batch Size 256 sequences 2048 tokens
Flash Attention ✅ v2
Prefix Caching N/A
CUDA Graphs N/A

License

MIT © 2025 Hexonithy Studios


Contributing

PRs welcome! Please ensure your code passes our checks:

pip install -e ".[dev]"
ruff check .
mypy hexonit_llm
pytest tests/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hexonit_llm-0.0.1.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hexonit_llm-0.0.1-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file hexonit_llm-0.0.1.tar.gz.

File metadata

  • Download URL: hexonit_llm-0.0.1.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hexonit_llm-0.0.1.tar.gz
Algorithm Hash digest
SHA256 aa12234c8438a3020c183f6c47faa68fcf3d04064ba2dd3b964eb9b18838b34a
MD5 7642e98d1eca3b9a6cf9e2c07b039dcb
BLAKE2b-256 7441fc1f42155043f8962920496aec7b9ca4d2d55e91d4e17e6b0bf29d41d72c

See more details on using hashes here.

File details

Details for the file hexonit_llm-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: hexonit_llm-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 22.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hexonit_llm-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fdb6964889dfb597bae004eca6d6f866784026c75d6b6a62a8959d3c96bed4f7
MD5 7a3b92310d001731f6f6b78ecfb72a2b
BLAKE2b-256 8a8a40fa58c152dc4c6d770a2862286b5d3b303cf4433c34aaff5274c0dc78e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page