Skip to main content

Ultra-fast local LLM inference with zero-config hardware-optimized speculative decoding.

Project description

hexonit-llm 🚀

Ultra-fast local LLM inference — zero config, one import, maximum tokens/sec.

Python 3.10+ License: MIT Code style: black


Philosophy

"One import. That's all."

hexonit-llm is an intelligent orchestrator that:

  1. Inspects your hardware — OS, VRAM, system RAM, CPU
  2. Selects the fastest engine — vLLM (Linux, ≥16GB VRAM) or llama.cpp (Windows/macOS/Linux)
  3. Enables speculative decoding — automatically downloads the matching draft model
  4. Delivers maximum tokens/sec — hardcoded, battle-tested optimisation presets

All with zero configuration.


Quick Start

Installation

pip install hexonit-llm        # core dependencies only
pip install hexonit-llm[vllm]      # + vLLM (Linux only)
pip install hexonit-llm[llamacpp]  # + llama.cpp (Windows/macOS/Linux)

Usage

from hexonit_llm import UltraInference

# That's it. One line.
pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")

# Generate text
response = pipe.generate("What is the meaning of life?")
print(response)

# Batch generation
responses = pipe.generate_batch([
    "Tell me a joke",
    "What is 2+2?",
])

# Chat interface
reply = pipe.chat([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
])

Check what's running

pipe = UltraInference("meta-llama/Meta-Llama-3-70B-Instruct")
print(pipe.engine_name)     # "vllm" or "llamacpp"
print(pipe.draft_model)     # "meta-llama/Llama-3.2-3B-Instruct"
print(pipe.hardware_info)

Supported Model Families

Family Target Model Auto-selected Draft
Meta LLaMA 3 Meta-Llama-3-70B-Instruct Llama-3.2-3B-Instruct
Meta LLaMA 3 Meta-Llama-3-8B-Instruct Llama-3.2-1B-Instruct
Qwen 2.5 Qwen2.5-72B-Instruct Qwen2.5-1.5B-Instruct
Mixtral Mixtral-8x22B-Instruct Ministral-8B-Instruct
Gemma 2 gemma-2-27b-it gemma-2-2b-it
DeepSeek DeepSeek-V2.5 deepseek-llm-7b-chat
Phi-3 Phi-3-medium-4k-instruct Phi-3-mini-4k-instruct
… and many more See model_mappings.py

Architecture

hexonit_llm/
├── __init__.py              # UltraInference – the public API
├── orchestrator.py          # The brain: hardware routing + engine factory
├── engines/
│   ├── vllm_engine.py       # vLLM backend (PagedAttention, FlashAttention-2)
│   └── llamacpp_engine.py   # llama.cpp backend (GGUF offloading)
├── config/
│   └── model_mappings.py    # 30+ target→draft model mappings
└── utils/
    ├── hardware_detector.py # OS, VRAM, RAM detection
    └── model_mapper.py      # HF Hub download & caching

Routing Logic

UltraInference(model)
    │
    ├── OS = Linux & VRAM ≥ 16GB  ──>  vLLM  (FlashAttention-2, PagedAttention)
    │
    └── OS = Windows / macOS
        or VRAM < 16GB           ──>  llama.cpp  (GGUF, GPU offloading)

Speculative decoding is always enabled when a matching draft model exists.


Performance

The engines ship with hardcoded, max-throughput presets:

Setting vLLM llama.cpp
GPU Memory Utilisation 95% All layers (-1)
Batch Size 256 sequences 2048 tokens
Flash Attention ✅ v2
Prefix Caching N/A
CUDA Graphs N/A

License

MIT © 2025 Hexonithy Studios


Contributing

PRs welcome! Please ensure your code passes our checks:

pip install -e ".[dev]"
ruff check .
mypy hexonit_llm
pytest tests/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hexonit_llm-0.0.2.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hexonit_llm-0.0.2-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file hexonit_llm-0.0.2.tar.gz.

File metadata

  • Download URL: hexonit_llm-0.0.2.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hexonit_llm-0.0.2.tar.gz
Algorithm Hash digest
SHA256 d59f70a95dd167d287298df5ab60c6ce2b9aa6634415095afae534bf4cfaba66
MD5 08ea233a101e5156b58e408ea7b477fd
BLAKE2b-256 a9cff7ed4d78eba01fb1e6acbfec932a9c8444346141b4fde7936cc7d2dc4490

See more details on using hashes here.

File details

Details for the file hexonit_llm-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: hexonit_llm-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for hexonit_llm-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0abd50b435471a384cfcc3d1a2359488e94e6ae17f967b2bab0ee128b1503c31
MD5 9babb47f5fa00ac2a968e75f683b40cc
BLAKE2b-256 a92f55a3372d20abfc36ea329352700cec4785fd8869e2b51b34e82ed8f94592

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page