A powerful llama-cpp-python based LLM serving tool similar to Ollama
Project description
Inferno
A powerful llama-cpp-python based LLM serving tool
Run local LLMs with an OpenAI-compatible API, interactive CLI, and seamless Hugging Face integration.
[!NOTE] Inferno automatically sets context length to 4096 tokens by default. You can adjust this with the
/set contextcommand in chat mode.
📚 Documentation
📖 Read the Full Documentation
Comprehensive guides, API references, and examples available at deepwiki.com/HelpingAI/inferno
✨ Overview
Inferno is a powerful tool for running Large Language Models (LLMs) locally on your machine. It provides an experience similar to Ollama but with enhanced features and flexibility. Inferno makes it easy to download, manage, and use GGUF models from Hugging Face with an intuitive command-line interface and API compatibility with popular tools.
🚀 Key Features
- 🤗 Hugging Face Integration: Download models directly with interactive file selection and repository browsing
- 🔄 Flexible Model Specification: Support for
repo_id:filenameformat for direct file targeting - 🔌 OpenAI & Ollama Compatible APIs: Use with any client that supports these APIs
- 🐍 Native Python Client: Built-in OpenAI-compatible Python client for seamless integration
- 💬 Interactive CLI: Powerful command-line interface for model management and chat
- ⚡ Streaming Support: Real-time streaming responses for chat and completions
- 🖥️ GPU Acceleration: Utilize GPU for faster inference when available
- 📏 Context Window Control: Adjust context size for different models and use cases
- 🧠 Model Management: Copy, show details, and list running models
- 📊 Embeddings Support: Generate embeddings from models
- ⚠️ RAM Requirement Warnings: Automatic warnings about RAM requirements for different model sizes
- 🔍 Max Context Detection: Automatically detects and displays maximum context length from GGUF files
- 📈 Quantization Tools: Convert models between different quantization levels with visual comparison and importance matrix support
- 🔄 Keep-Alive Control: Configure model unloading behavior with keep-alive settings
- 🛠️ Advanced Configuration: Set custom parameters like threads, batch size, and RoPE settings
⚙️ Installation
Install Inferno directly from source:
# Clone the repository
git clone https://github.com/HelpingAI/inferno.git
cd inferno
# Install in development mode
pip install -e .
# Or install with all dependencies
pip install -e ".[dev]"
🖥️ Command Line Interface
Inferno provides a powerful command-line interface for managing and using LLMs:
# Show available commands
inferno --help
# Using as a Python module
python -m inferno --help
| Command | Description |
|---|---|
inferno pull <model> |
Download a model from Hugging Face |
inferno list |
List downloaded models with RAM requirements |
inferno serve <model> |
Start a model server with OpenAI & Ollama compatible APIs |
inferno run <model> |
Chat with a model interactively |
inferno remove <model> |
Remove a downloaded model |
inferno copy <source> <dest> |
Copy a model to a new name |
inferno show <model> |
Show detailed model information |
inferno ps |
List running models |
inferno quantize <model> <output> |
Quantize a model to a different format |
inferno compare <models...> |
Compare multiple models (size, metrics) |
inferno estimate <model> |
Show RAM usage estimates for quantization |
inferno version |
Show version information |
📋 Usage Guide
Download a Model
# Download a model from Hugging Face (interactive file selection)
inferno pull Abhaykoul/HAI3-raw-Q4_K_M-GGUF
# Download a specific file using repo_id:filename format
inferno pull Abhaykoul/HAI3-raw-Q4_K_M-GGUF:hai3-raw-q4_k_m.gguf
When downloading models, Inferno will:
- Show available GGUF files in the repository
- Display file sizes and RAM requirements
- Show maximum context length for each model
- Provide a comparison of RAM usage by quantization type
- Warn if your system has insufficient RAM
Model Quantization
Inferno provides an interactive quantization interface:
# Quantize a HuggingFace model (interactive)
inferno quantize hf:Qwen/Qwen3-0.6B
# The command will:
# 1. Show available methods with RAM estimates
# 2. Let you select the preferred method
# 3. Download and convert the model
# 4. Save in the inferno models directory
Interactive Quantization UI
When you run the quantize command, you'll see:
-
A table of available methods showing:
- Method name (e.g., q4_k_m)
- Bits per parameter
- RAM multiplier (e.g., 1.40× model size)
- Description and use case
-
RAM Usage Examples: For a 3GB gguf model file:
- q2_k: ~3.45GB RAM (3GB × 1.15)
- q4_k_m: ~4.20GB RAM (3GB × 1.40)
- q8_0: ~6.00GB RAM (3GB × 2.00)
- f16: ~8.40GB RAM (3GB × 2.80)
Available Methods
| Method | Bits/Param | RAM Usage | Best For |
|---|---|---|---|
| q2_k | ~2.5 bits | 1.15× size | Minimum RAM usage, lower quality |
| q3_k_m | ~3.5 bits | 1.28× size | Good balance of RAM/quality |
| q4_k_m | ~4.5 bits | 1.40× size | Best general-purpose choice |
| q5_k_m | ~5.5 bits | 1.65× size | Better quality, more RAM |
| q6_k | ~6.5 bits | 1.80× size | High quality, high RAM |
| q8_0 | ~8.5 bits | 2.00× size | Very high quality |
| f16 | 16.0 bits | 2.80× size | Maximum quality, highest RAM |
RAM usage consists of two parts:
RAM Usage Calculation (Recommended)
[!NOTE] The following RAM usage estimates are based on how the Hugging Face
transformerslibrary loads models in FP16 (float16) precision. We use the FP16 model file size as the baseline for these calculations in this README. Actual RAM usage may vary depending on backend and quantization.
-
Base Model RAM:
The FP16 model file size is roughly 2× the number of parameters in billions:- 1B parameters ≈ 2GB (FP16) file size
- 3B parameters ≈ 6GB (FP16) file size
- 7B parameters ≈ 14GB (FP16) file size
Multiply by quantization factor for estimated RAM:
- 2GB (gguf model) × 1.40 (q4_k_m) ≈ 2.8GB estimated RAM
- 6GB (gguf model) × 1.40 (q4_k_m) ≈ 8.4GB estimated RAM
-
Context RAM:
Additional RAM is needed for the context window (per billion parameters):- 4K context ≈ +0.2GB RAM
- 8K context ≈ +0.4GB RAM
- 16K context ≈ +0.8GB RAM
- 32K context ≈ +1.6GB RAM
[!NOTE] You can run models on systems with less RAM than recommended, but expect slower performance and possible swapping to disk.
Importance Matrix Quantization
For better quality at the same size, use importance matrix quantization:
| Method | Description |
|---|---|
| iq3_m | 3-bit importance-weighted |
| iq4_nl | 4-bit non-linear (best accuracy) |
| iq4_xs | 4-bit extra small size |
List Downloaded Models
inferno list
The list command shows:
- Model names and repositories
- File sizes and quantization types
- RAM requirements (color-coded based on your system's RAM)
- Download dates
- Quantization comparison table
Start the Server
# Start the server with a downloaded model
inferno serve HAI3-raw-Q4_K_M-GGUF
# Start the server with a model from Hugging Face (downloads if needed)
inferno serve Abhaykoul/HAI3-raw-Q4_K_M-GGUF
# Specify host and port
inferno serve HAI3-raw-Q4_K_M-GGUF --host 0.0.0.0 --port 8080
The server provides:
- OpenAI-compatible API endpoints (/v1/...)
- Ollama-compatible API endpoints (/api/...)
- Support for chat completions, text completions, and embeddings
- Streaming responses
- Automatic model loading and unloading
Chat with a Model
inferno run HAI3-raw-Q4_K_M-GGUF
Available Chat Commands
| Command | Description |
|---|---|
/help or /? |
Show available commands |
/bye |
Exit the chat |
/set system <prompt> |
Set the system prompt (use quotes for multi-word prompts) |
/set context <size> |
Set context window size (default: 4096) |
/clear or /cls |
Clear the terminal screen |
/reset |
Reset all settings |
🔌 API Usage
Inferno provides both OpenAI-compatible and Ollama-compatible APIs. You can use it with any client that supports either API.
OpenAI API Endpoints
/v1/models- List available models/v1/chat/completions- Create chat completions/v1/completions- Create text completions/v1/embeddings- Generate embeddings
Ollama API Endpoints
/api/chat- Create chat completions/api/generate- Create text completions/api/embed- Generate embeddings/api/tags- List available models/api/show- Show model details/api/copy- Copy a model/api/delete- Delete a model/api/pull- Pull a model
Python Example (OpenAI API)
import openai
# Configure the client
openai.api_key = "dummy" # Not used but required
openai.api_base = "http://localhost:8000/v1" # Default Inferno API URL
# Chat completion
response = openai.ChatCompletion.create(
model="HAI3-raw-Q4_K_M-GGUF", # Use the model name
messages=[
{"role": "user", "content": "Hello, how are you?"}
]
)
print(response.choices[0].message.content)
# Streaming chat completion
for chunk in openai.ChatCompletion.create(
model="HAI3-raw-Q4_K_M-GGUF",
messages=[
{"role": "user", "content": "Tell me a joke"}
],
stream=True
):
if hasattr(chunk.choices[0], "delta") and hasattr(chunk.choices[0].delta, "content"):
print(chunk.choices[0].delta.content, end="", flush=True)
🧩 Integration with Applications
Inferno can be easily integrated with various applications that support the OpenAI API format:
# Example with LangChain
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
# Configure to use local Inferno server with OpenAI API
chat = ChatOpenAI(
model_name="HAI3-raw-Q4_K_M-GGUF",
openai_api_key="dummy",
openai_api_base="http://localhost:8000/v1",
streaming=True
)
# Use the model
response = chat([HumanMessage(content="Explain quantum computing in simple terms")])
print(response.content)
Ollama API Example
import requests
import json
# Chat completion with Ollama API
response = requests.post(
"http://localhost:8000/api/chat",
json={
"model": "HAI3-raw-Q4_K_M-GGUF",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}
)
print(response.json()["message"]["content"])
# Generate embeddings
response = requests.post(
"http://localhost:8000/api/embed",
json={
"model": "HAI3-raw-Q4_K_M-GGUF",
"input": "Hello, world!"
}
)
print(response.json()["embeddings"])
🐍 Native Python Client
Inferno includes a built-in Python client that provides a drop-in replacement for the OpenAI Python client. This allows you to use Inferno with existing code that uses the OpenAI client without any modifications.
Using the Native Client
from inferno.client import InfernoClient
# Initialize the client
client = InfernoClient(
api_key="dummy", # Not used by Inferno but kept for OpenAI compatibility
api_base="http://localhost:8000/v1", # Default Inferno API URL
)
# Chat completions
response = client.chat.create(
model="HAI3-raw-Q4_K_M-GGUF",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
max_tokens=100,
temperature=0.7,
)
print(response["choices"][0]["message"]["content"])
# Streaming chat completions
stream = client.chat.create(
model="HAI3-raw-Q4_K_M-GGUF",
messages=[
{"role": "user", "content": "Tell me a joke"}
],
max_tokens=100,
temperature=0.7,
stream=True,
)
for chunk in stream:
if "choices" in chunk and len(chunk["choices"]) > 0:
if "delta" in chunk["choices"][0] and "content" in chunk["choices"][0]["delta"]:
content = chunk["choices"][0]["delta"]["content"]
print(content, end="", flush=True)
# Embeddings
response = client.embeddings.create(
model="HAI3-raw-Q4_K_M-GGUF",
input="Hello, world!",
)
print(response["data"][0]["embedding"])
# List models
models = client.models.list()
for model in models["data"]:
print(model["id"])
Client Features
- OpenAI Compatibility: Drop-in replacement for the OpenAI Python client
- Streaming Support: Stream responses for chat completions and text completions
- Embeddings: Generate embeddings from text
- Model Management: List and retrieve available models
- Error Handling: Comprehensive error handling with retries
- Configuration Options: Customize timeout, retries, and headers
For more details, see the Python Client README.
📦 Requirements
Software Requirements
- Python 3.9+
- llama-cpp-python
- FastAPI
- Uvicorn
- Rich (for terminal UI)
- Typer (for CLI)
- Hugging Face Hub
- Pydantic
- Requests
Hardware Requirements
- Around 2 GB of RAM is needed for 1B models
- Around 4 GB of RAM is needed for 3B models
- You should have at least 8 GB of RAM available to run 7B models
- 16 GB of RAM is recommended for 13B models
- 32 GB of RAM is required for 33B models
- GPU acceleration is recommended for better performance
Quantization Types and RAM Usage
| Quantization | Bits/Param | RAM Multiplier | Description |
|---|---|---|---|
| Q2_K | ~2.5 | 1.15× | 2-bit quantization (lowest quality, smallest size) |
| Q3_K_M | ~3.5 | 1.28× | 3-bit quantization (medium) |
| Q4_K_M | ~4.5 | 1.40× | 4-bit quantization (balanced quality/size) |
| Q5_K_M | ~5.5 | 1.65× | 5-bit quantization (better quality) |
| Q6_K | ~6.5 | 1.80× | 6-bit quantization (high quality) |
| Q8_0 | ~8.5 | 2.00× | 8-bit quantization (very high quality) |
| F16 | 16.0 | 2.80× | 16-bit float (highest quality, largest size) |
🔧 Advanced Configuration
Inferno allows you to configure various aspects of model loading and inference:
GPU Acceleration
# Set number of layers to offload to GPU
inferno serve HAI3-raw-Q4_K_M-GGUF --n_gpu_layers 32
Context Length
# Set custom context length
inferno serve HAI3-raw-Q4_K_M-GGUF --n_ctx 8192
Threading
# Set number of threads for inference
inferno serve HAI3-raw-Q4_K_M-GGUF --n_threads 8
Memory Options
# Use mlock to keep model in memory
inferno serve HAI3-raw-Q4_K_M-GGUF --use_mlock
🤝 Contributing
Contributions are welcome! If you'd like to contribute to Inferno, please follow these steps:
- Fork the repository
- Create a new branch for your feature or bug fix
- Make your changes and commit them with descriptive messages
- Push your branch to your forked repository
- Submit a pull request to the main repository
📄 License
This project is licensed under the HelpingAI Open Source License - a custom license that promotes open innovation and collaboration while ensuring responsible and ethical use of AI technology.
Made with ❤️ by HelpingAI
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inferno_llm-0.1.2.tar.gz.
File metadata
- Download URL: inferno_llm-0.1.2.tar.gz
- Upload date:
- Size: 73.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04865a2f780028f0149aece1473b06d3ac5bf907cf8672095691ce05eee6d75f
|
|
| MD5 |
6fcc4e93a3f2fe01cd370ce2e45a2101
|
|
| BLAKE2b-256 |
627c849f699370be32e3533306d35722377ea735728d6d7fe5304862cf66fbc9
|
File details
Details for the file inferno_llm-0.1.2-py3-none-any.whl.
File metadata
- Download URL: inferno_llm-0.1.2-py3-none-any.whl
- Upload date:
- Size: 75.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31700d08480a9366d492ef8216dab0261a6cdfd65f5d850d4abe6fd82a38a18c
|
|
| MD5 |
b099b0e585c41b9bed782be7d7d52473
|
|
| BLAKE2b-256 |
0aea7bcafdea316d14980a6d39c3990ad314ba0d84ca2fcccdeefdbc67122dc5
|