A powerful llama-cpp-python based LLM serving tool similar to Ollama

These details have not been verified by PyPI

Project links

Homepage

Project description

Inferno

A powerful llama-cpp-python based LLM serving tool

Run local LLMs with an OpenAI-compatible API, interactive CLI, and seamless Hugging Face integration.

[!NOTE] Inferno automatically sets context length to 4096 tokens by default. You can adjust this with the /set context command in chat mode.

📚 Documentation

📖 Read the Full Documentation

Comprehensive guides, API references, and examples available at deepwiki.com/HelpingAI/inferno

✨ Overview

Inferno is a powerful tool for running Large Language Models (LLMs) locally on your machine. It provides an experience similar to Ollama but with enhanced features and flexibility. Inferno makes it easy to download, manage, and use GGUF models from Hugging Face with an intuitive command-line interface and API compatibility with popular tools.

🚀 Key Features

🤗 Hugging Face Integration: Download models directly with interactive file selection and repository browsing
🔄 Flexible Model Specification: Support for repo_id:filename format for direct file targeting
🔌 OpenAI & Ollama Compatible APIs: Use with any client that supports these APIs
🐍 Native Python Client: Built-in OpenAI-compatible Python client for seamless integration
💬 Interactive CLI: Powerful command-line interface for model management and chat
⚡ Streaming Support: Real-time streaming responses for chat and completions
🖥️ GPU Acceleration: Utilize GPU for faster inference when available
📏 Context Window Control: Adjust context size for different models and use cases
🧠 Model Management: Copy, show details, and list running models
📊 Embeddings Support: Generate embeddings from models
⚠️ RAM Requirement Warnings: Automatic warnings about RAM requirements for different model sizes
🔍 Max Context Detection: Automatically detects and displays maximum context length from GGUF files
📈 Quantization Tools: Convert models between different quantization levels with visual comparison and importance matrix support
🔄 Keep-Alive Control: Configure model unloading behavior with keep-alive settings
🛠️ Advanced Configuration: Set custom parameters like threads, batch size, and RoPE settings

⚙️ Installation

Install Inferno directly from source:

# Clone the repository
git clone https://github.com/HelpingAI/inferno.git
cd inferno

# Install in development mode
pip install -e .

# Or install with all dependencies
pip install -e ".[dev]"

🖥️ Command Line Interface

Inferno provides a powerful command-line interface for managing and using LLMs:

# Show available commands
inferno --help

# Using as a Python module
python -m inferno --help

Command	Description
`inferno pull <model>`	Download a model from Hugging Face
`inferno list`	List downloaded models with RAM requirements
`inferno serve <model>`	Start a model server with OpenAI & Ollama compatible APIs
`inferno run <model>`	Chat with a model interactively
`inferno remove <model>`	Remove a downloaded model
`inferno copy <source> <dest>`	Copy a model to a new name
`inferno show <model>`	Show detailed model information
`inferno ps`	List running models
`inferno quantize <model> <output>`	Quantize a model to a different format
`inferno compare <models...>`	Compare multiple models (size, metrics)
`inferno estimate <model>`	Show RAM usage estimates for quantization
`inferno version`	Show version information

📋 Usage Guide

Download a Model

# Download a model from Hugging Face (interactive file selection)
inferno pull Abhaykoul/HAI3-raw-Q4_K_M-GGUF

# Download a specific file using repo_id:filename format
inferno pull Abhaykoul/HAI3-raw-Q4_K_M-GGUF:hai3-raw-q4_k_m.gguf

When downloading models, Inferno will:

Show available GGUF files in the repository
Display file sizes and RAM requirements
Show maximum context length for each model
Provide a comparison of RAM usage by quantization type
Warn if your system has insufficient RAM

Model Quantization

Inferno provides an interactive quantization interface:

# Quantize a HuggingFace model (interactive)
inferno quantize hf:Qwen/Qwen3-0.6B

# The command will:
# 1. Show available methods with RAM estimates
# 2. Let you select the preferred method
# 3. Download and convert the model
# 4. Save in the inferno models directory

Interactive Quantization UI

When you run the quantize command, you'll see:

A table of available methods showing:
- Method name (e.g., q4_k_m)
- Bits per parameter
- RAM multiplier (e.g., 1.40× model size)
- Description and use case
RAM Usage Examples: For a 3GB gguf model file:
- q2_k: ~3.45GB RAM (3GB × 1.15)
- q4_k_m: ~4.20GB RAM (3GB × 1.40)
- q8_0: ~6.00GB RAM (3GB × 2.00)
- f16: ~8.40GB RAM (3GB × 2.80)

Available Methods

Method	Bits/Param	RAM Usage	Best For
q2_k	~2.5 bits	1.15× size	Minimum RAM usage, lower quality
q3_k_m	~3.5 bits	1.28× size	Good balance of RAM/quality
q4_k_m	~4.5 bits	1.40× size	Best general-purpose choice
q5_k_m	~5.5 bits	1.65× size	Better quality, more RAM
q6_k	~6.5 bits	1.80× size	High quality, high RAM
q8_0	~8.5 bits	2.00× size	Very high quality
f16	16.0 bits	2.80× size	Maximum quality, highest RAM

RAM usage consists of two parts:

RAM Usage Calculation (Recommended)

[!NOTE] The following RAM usage estimates are based on how the Hugging Face transformers library loads models in FP16 (float16) precision. We use the FP16 model file size as the baseline for these calculations in this README. Actual RAM usage may vary depending on backend and quantization.

Base Model RAM:
The FP16 model file size is roughly 2× the number of parameters in billions:
- 1B parameters ≈ 2GB (FP16) file size
- 3B parameters ≈ 6GB (FP16) file size
- 7B parameters ≈ 14GB (FP16) file size
Multiply by quantization factor for estimated RAM:
- 2GB (gguf model) × 1.40 (q4_k_m) ≈ 2.8GB estimated RAM
- 6GB (gguf model) × 1.40 (q4_k_m) ≈ 8.4GB estimated RAM
Context RAM:
Additional RAM is needed for the context window (per billion parameters):
- 4K context ≈ +0.2GB RAM
- 8K context ≈ +0.4GB RAM
- 16K context ≈ +0.8GB RAM
- 32K context ≈ +1.6GB RAM

[!NOTE] You can run models on systems with less RAM than recommended, but expect slower performance and possible swapping to disk.

Importance Matrix Quantization

For better quality at the same size, use importance matrix quantization:

Method	Description
iq3_m	3-bit importance-weighted
iq4_nl	4-bit non-linear (best accuracy)
iq4_xs	4-bit extra small size

List Downloaded Models

inferno list

The list command shows:

Model names and repositories
File sizes and quantization types
RAM requirements (color-coded based on your system's RAM)
Download dates
Quantization comparison table

Start the Server

# Start the server with a downloaded model
inferno serve HAI3-raw-Q4_K_M-GGUF

# Start the server with a model from Hugging Face (downloads if needed)
inferno serve Abhaykoul/HAI3-raw-Q4_K_M-GGUF

# Specify host and port
inferno serve HAI3-raw-Q4_K_M-GGUF --host 0.0.0.0 --port 8080

The server provides:

OpenAI-compatible API endpoints (/v1/...)
Ollama-compatible API endpoints (/api/...)
Support for chat completions, text completions, and embeddings
Streaming responses
Automatic model loading and unloading

Chat with a Model

inferno run HAI3-raw-Q4_K_M-GGUF

Available Chat Commands

Command	Description
`/help` or `/?`	Show available commands
`/bye`	Exit the chat
`/set system <prompt>`	Set the system prompt (use quotes for multi-word prompts)
`/set context <size>`	Set context window size (default: 4096)
`/clear` or `/cls`	Clear the terminal screen
`/reset`	Reset all settings

🔌 API Usage

Inferno provides both OpenAI-compatible and Ollama-compatible APIs. You can use it with any client that supports either API.

OpenAI API Endpoints

/v1/models - List available models
/v1/chat/completions - Create chat completions
/v1/completions - Create text completions
/v1/embeddings - Generate embeddings

Ollama API Endpoints

/api/chat - Create chat completions
/api/generate - Create text completions
/api/embed - Generate embeddings
/api/tags - List available models
/api/show - Show model details
/api/copy - Copy a model
/api/delete - Delete a model
/api/pull - Pull a model

Python Example (OpenAI API)

import openai

# Configure the client
openai.api_key = "dummy"  # Not used but required
openai.api_base = "http://localhost:8000/v1"  # Default Inferno API URL

# Chat completion
response = openai.ChatCompletion.create(
    model="HAI3-raw-Q4_K_M-GGUF",  # Use the model name
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

print(response.choices[0].message.content)

# Streaming chat completion
for chunk in openai.ChatCompletion.create(
    model="HAI3-raw-Q4_K_M-GGUF",
    messages=[
        {"role": "user", "content": "Tell me a joke"}
    ],
    stream=True
):
    if hasattr(chunk.choices[0], "delta") and hasattr(chunk.choices[0].delta, "content"):
        print(chunk.choices[0].delta.content, end="", flush=True)

🧩 Integration with Applications

Inferno can be easily integrated with various applications that support the OpenAI API format:

# Example with LangChain
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage

# Configure to use local Inferno server with OpenAI API
chat = ChatOpenAI(
    model_name="HAI3-raw-Q4_K_M-GGUF",
    openai_api_key="dummy",
    openai_api_base="http://localhost:8000/v1",
    streaming=True
)

# Use the model
response = chat([HumanMessage(content="Explain quantum computing in simple terms")])
print(response.content)

Ollama API Example

import requests
import json

# Chat completion with Ollama API
response = requests.post(
    "http://localhost:8000/api/chat",
    json={
        "model": "HAI3-raw-Q4_K_M-GGUF",
        "messages": [
            {"role": "user", "content": "Hello, how are you?"}
        ]
    }
)

print(response.json()["message"]["content"])

# Generate embeddings
response = requests.post(
    "http://localhost:8000/api/embed",
    json={
        "model": "HAI3-raw-Q4_K_M-GGUF",
        "input": "Hello, world!"
    }
)

print(response.json()["embeddings"])

🐍 Native Python Client

Inferno includes a built-in Python client that provides a drop-in replacement for the OpenAI Python client. This allows you to use Inferno with existing code that uses the OpenAI client without any modifications.

Using the Native Client

from inferno.client import InfernoClient

# Initialize the client
client = InfernoClient(
    api_key="dummy",  # Not used by Inferno but kept for OpenAI compatibility
    api_base="http://localhost:8000/v1",  # Default Inferno API URL
)

# Chat completions
response = client.chat.create(
    model="HAI3-raw-Q4_K_M-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how are you?"}
    ],
    max_tokens=100,
    temperature=0.7,
)

print(response["choices"][0]["message"]["content"])

# Streaming chat completions
stream = client.chat.create(
    model="HAI3-raw-Q4_K_M-GGUF",
    messages=[
        {"role": "user", "content": "Tell me a joke"}
    ],
    max_tokens=100,
    temperature=0.7,
    stream=True,
)

for chunk in stream:
    if "choices" in chunk and len(chunk["choices"]) > 0:
        if "delta" in chunk["choices"][0] and "content" in chunk["choices"][0]["delta"]:
            content = chunk["choices"][0]["delta"]["content"]
            print(content, end="", flush=True)

# Embeddings
response = client.embeddings.create(
    model="HAI3-raw-Q4_K_M-GGUF",
    input="Hello, world!",
)

print(response["data"][0]["embedding"])

# List models
models = client.models.list()
for model in models["data"]:
    print(model["id"])

Client Features

OpenAI Compatibility: Drop-in replacement for the OpenAI Python client
Streaming Support: Stream responses for chat completions and text completions
Embeddings: Generate embeddings from text
Model Management: List and retrieve available models
Error Handling: Comprehensive error handling with retries
Configuration Options: Customize timeout, retries, and headers

For more details, see the Python Client README.

📦 Requirements

Software Requirements

Python 3.9+
llama-cpp-python
FastAPI
Uvicorn
Rich (for terminal UI)
Typer (for CLI)
Hugging Face Hub
Pydantic
Requests

Hardware Requirements

Around 2 GB of RAM is needed for 1B models
Around 4 GB of RAM is needed for 3B models
You should have at least 8 GB of RAM available to run 7B models
16 GB of RAM is recommended for 13B models
32 GB of RAM is required for 33B models
GPU acceleration is recommended for better performance

Quantization Types and RAM Usage

Quantization	Bits/Param	RAM Multiplier	Description
Q2_K	~2.5	1.15×	2-bit quantization (lowest quality, smallest size)
Q3_K_M	~3.5	1.28×	3-bit quantization (medium)
Q4_K_M	~4.5	1.40×	4-bit quantization (balanced quality/size)
Q5_K_M	~5.5	1.65×	5-bit quantization (better quality)
Q6_K	~6.5	1.80×	6-bit quantization (high quality)
Q8_0	~8.5	2.00×	8-bit quantization (very high quality)
F16	16.0	2.80×	16-bit float (highest quality, largest size)

🔧 Advanced Configuration

Inferno allows you to configure various aspects of model loading and inference:

GPU Acceleration

# Set number of layers to offload to GPU
inferno serve HAI3-raw-Q4_K_M-GGUF --n_gpu_layers 32

Context Length

# Set custom context length
inferno serve HAI3-raw-Q4_K_M-GGUF --n_ctx 8192

Threading

# Set number of threads for inference
inferno serve HAI3-raw-Q4_K_M-GGUF --n_threads 8

Memory Options

# Use mlock to keep model in memory
inferno serve HAI3-raw-Q4_K_M-GGUF --use_mlock

🤝 Contributing

Contributions are welcome! If you'd like to contribute to Inferno, please follow these steps:

Fork the repository
Create a new branch for your feature or bug fix
Make your changes and commit them with descriptive messages
Push your branch to your forked repository
Submit a pull request to the main repository

📄 License

This project is licensed under the HelpingAI Open Source License - a custom license that promotes open innovation and collaboration while ensuring responsible and ethical use of AI technology.

Made with ❤️ by HelpingAI

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.7

Jan 5, 2026

0.1.6

May 10, 2025

0.1.5

May 6, 2025

0.1.4

May 5, 2025

0.1.3

May 5, 2025

This version

0.1.2

May 4, 2025

0.1.0

May 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferno_llm-0.1.2.tar.gz (73.0 kB view details)

Uploaded May 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inferno_llm-0.1.2-py3-none-any.whl (75.2 kB view details)

Uploaded May 4, 2025 Python 3

File details

Details for the file inferno_llm-0.1.2.tar.gz.

File metadata

Download URL: inferno_llm-0.1.2.tar.gz
Upload date: May 4, 2025
Size: 73.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for inferno_llm-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`04865a2f780028f0149aece1473b06d3ac5bf907cf8672095691ce05eee6d75f`
MD5	`6fcc4e93a3f2fe01cd370ce2e45a2101`
BLAKE2b-256	`627c849f699370be32e3533306d35722377ea735728d6d7fe5304862cf66fbc9`

See more details on using hashes here.

File details

Details for the file inferno_llm-0.1.2-py3-none-any.whl.

File metadata

Download URL: inferno_llm-0.1.2-py3-none-any.whl
Upload date: May 4, 2025
Size: 75.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for inferno_llm-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`31700d08480a9366d492ef8216dab0261a6cdfd65f5d850d4abe6fd82a38a18c`
MD5	`b099b0e585c41b9bed782be7d7d52473`
BLAKE2b-256	`0aea7bcafdea316d14980a6d39c3990ad314ba0d84ca2fcccdeefdbc67122dc5`

See more details on using hashes here.

inferno-llm 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Inferno

📚 Documentation

📖 Read the Full Documentation

✨ Overview

🚀 Key Features

⚙️ Installation

🖥️ Command Line Interface

📋 Usage Guide

Download a Model

Model Quantization

Interactive Quantization UI

Available Methods

RAM Usage Calculation (Recommended)

Importance Matrix Quantization

List Downloaded Models

Start the Server

Chat with a Model

Available Chat Commands

🔌 API Usage

OpenAI API Endpoints

Ollama API Endpoints

Python Example (OpenAI API)

🧩 Integration with Applications

Ollama API Example

🐍 Native Python Client

Using the Native Client

Client Features

📦 Requirements

Software Requirements

Hardware Requirements

Quantization Types and RAM Usage

🔧 Advanced Configuration

GPU Acceleration

Context Length

Threading

Memory Options

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes