Skip to main content

A powerful llama-cpp-python based LLM serving tool similar to Ollama

Project description

Inferno Logo

Inferno

A powerful llama-cpp-python based LLM serving tool

Run local LLMs with an OpenAI-compatible API, interactive CLI, and seamless Hugging Face integration.

License Python Version Platform

[!NOTE] Inferno automatically sets context length to 4096 tokens by default. You can adjust this with the /set context command in chat mode.

GPU Accelerated OpenAI Compatible Hugging Face

✨ Overview

Inferno is a powerful tool for running Large Language Models (LLMs) locally on your machine. It provides an experience similar to Ollama but with enhanced features and flexibility. Inferno makes it easy to download, manage, and use GGUF models from Hugging Face with an intuitive command-line interface and API compatibility with popular tools.

🚀 Key Features

  • 🤗 Hugging Face Integration: Download models directly with interactive file selection and repository browsing
  • 🔄 Flexible Model Specification: Support for repo_id:filename format for direct file targeting
  • 🔌 OpenAI & Ollama Compatible APIs: Use with any client that supports these APIs
  • 🐍 Native Python Client: Built-in OpenAI-compatible Python client for seamless integration
  • 💬 Interactive CLI: Powerful command-line interface for model management and chat
  • ⚡ Streaming Support: Real-time streaming responses for chat and completions
  • 🖥️ GPU Acceleration: Utilize GPU for faster inference when available
  • 📏 Context Window Control: Adjust context size for different models and use cases
  • 🧠 Model Management: Copy, show details, and list running models
  • 📊 Embeddings Support: Generate embeddings from models
  • ⚠️ RAM Requirement Warnings: Automatic warnings about RAM requirements for different model sizes
  • 🔍 Max Context Detection: Automatically detects and displays maximum context length from GGUF files
  • 📈 Quantization Comparison: View RAM usage by different quantization types
  • 🔄 Keep-Alive Control: Configure model unloading behavior with keep-alive settings
  • 🛠️ Advanced Configuration: Set custom parameters like threads, batch size, and RoPE settings

⚙️ Installation

Install Inferno directly from source:

# Clone the repository
git clone https://github.com/HelpingAI/inferno.git
cd inferno

# Install in development mode
pip install -e .

# Or install with all dependencies
pip install -e ".[dev]"

🖥️ Command Line Interface

Inferno provides a powerful command-line interface for managing and using LLMs:

# Show available commands
inferno --help

# Using as a Python module
python -m inferno --help
Command Description
inferno pull <model> Download a model from Hugging Face
inferno list List downloaded models with RAM requirements
inferno serve <model> Start a model server with OpenAI & Ollama compatible APIs
inferno run <model> Chat with a model interactively
inferno remove <model> Remove a downloaded model
inferno copy <source> <dest> Copy a model to a new name
inferno show <model> Show detailed model information
inferno ps List running models
inferno version Show version information

📋 Usage Guide

Download a Model

# Download a model from Hugging Face (interactive file selection)
inferno pull Abhaykoul/HAI3-raw-Q4_K_M-GGUF

# Download a specific file using repo_id:filename format
inferno pull Abhaykoul/HAI3-raw-Q4_K_M-GGUF:hai3-raw-q4_k_m.gguf

When downloading models, Inferno will:

  • Show available GGUF files in the repository
  • Display file sizes and RAM requirements
  • Show maximum context length for each model
  • Provide a comparison of RAM usage by quantization type
  • Warn if your system has insufficient RAM

List Downloaded Models

inferno list

The list command shows:

  • Model names and repositories
  • File sizes and quantization types
  • RAM requirements (color-coded based on your system's RAM)
  • Download dates
  • Quantization comparison table

Start the Server

# Start the server with a downloaded model
inferno serve HAI3-raw-Q4_K_M-GGUF

# Start the server with a model from Hugging Face (downloads if needed)
inferno serve Abhaykoul/HAI3-raw-Q4_K_M-GGUF

# Specify host and port
inferno serve HAI3-raw-Q4_K_M-GGUF --host 0.0.0.0 --port 8080

The server provides:

  • OpenAI-compatible API endpoints (/v1/...)
  • Ollama-compatible API endpoints (/api/...)
  • Support for chat completions, text completions, and embeddings
  • Streaming responses
  • Automatic model loading and unloading

Chat with a Model

inferno run HAI3-raw-Q4_K_M-GGUF

Available Chat Commands

Command Description
/help or /? Show available commands
/bye Exit the chat
/set system <prompt> Set the system prompt (use quotes for multi-word prompts)
/set context <size> Set context window size (default: 4096)
/clear or /cls Clear the terminal screen
/reset Reset all settings

🔌 API Usage

Inferno provides both OpenAI-compatible and Ollama-compatible APIs. You can use it with any client that supports either API.

OpenAI API Endpoints

  • /v1/models - List available models
  • /v1/chat/completions - Create chat completions
  • /v1/completions - Create text completions
  • /v1/embeddings - Generate embeddings

Ollama API Endpoints

  • /api/chat - Create chat completions
  • /api/generate - Create text completions
  • /api/embed - Generate embeddings
  • /api/tags - List available models
  • /api/show - Show model details
  • /api/copy - Copy a model
  • /api/delete - Delete a model
  • /api/pull - Pull a model

Python Example (OpenAI API)

import openai

# Configure the client
openai.api_key = "dummy"  # Not used but required
openai.api_base = "http://localhost:8000/v1"  # Default Inferno API URL

# Chat completion
response = openai.ChatCompletion.create(
    model="HAI3-raw-Q4_K_M-GGUF",  # Use the model name
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

print(response.choices[0].message.content)

# Streaming chat completion
for chunk in openai.ChatCompletion.create(
    model="HAI3-raw-Q4_K_M-GGUF",
    messages=[
        {"role": "user", "content": "Tell me a joke"}
    ],
    stream=True
):
    if hasattr(chunk.choices[0], "delta") and hasattr(chunk.choices[0].delta, "content"):
        print(chunk.choices[0].delta.content, end="", flush=True)

🧩 Integration with Applications

Inferno can be easily integrated with various applications that support the OpenAI API format:

# Example with LangChain
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage

# Configure to use local Inferno server with OpenAI API
chat = ChatOpenAI(
    model_name="HAI3-raw-Q4_K_M-GGUF",
    openai_api_key="dummy",
    openai_api_base="http://localhost:8000/v1",
    streaming=True
)

# Use the model
response = chat([HumanMessage(content="Explain quantum computing in simple terms")])
print(response.content)

Ollama API Example

import requests
import json

# Chat completion with Ollama API
response = requests.post(
    "http://localhost:8000/api/chat",
    json={
        "model": "HAI3-raw-Q4_K_M-GGUF",
        "messages": [
            {"role": "user", "content": "Hello, how are you?"}
        ]
    }
)

print(response.json()["message"]["content"])

# Generate embeddings
response = requests.post(
    "http://localhost:8000/api/embed",
    json={
        "model": "HAI3-raw-Q4_K_M-GGUF",
        "input": "Hello, world!"
    }
)

print(response.json()["embeddings"])

🐍 Native Python Client

Inferno includes a built-in Python client that provides a drop-in replacement for the OpenAI Python client. This allows you to use Inferno with existing code that uses the OpenAI client without any modifications.

Using the Native Client

from inferno.client import InfernoClient

# Initialize the client
client = InfernoClient(
    api_key="dummy",  # Not used by Inferno but kept for OpenAI compatibility
    api_base="http://localhost:8000/v1",  # Default Inferno API URL
)

# Chat completions
response = client.chat.create(
    model="HAI3-raw-Q4_K_M-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how are you?"}
    ],
    max_tokens=100,
    temperature=0.7,
)

print(response["choices"][0]["message"]["content"])

# Streaming chat completions
stream = client.chat.create(
    model="HAI3-raw-Q4_K_M-GGUF",
    messages=[
        {"role": "user", "content": "Tell me a joke"}
    ],
    max_tokens=100,
    temperature=0.7,
    stream=True,
)

for chunk in stream:
    if "choices" in chunk and len(chunk["choices"]) > 0:
        if "delta" in chunk["choices"][0] and "content" in chunk["choices"][0]["delta"]:
            content = chunk["choices"][0]["delta"]["content"]
            print(content, end="", flush=True)

# Embeddings
response = client.embeddings.create(
    model="HAI3-raw-Q4_K_M-GGUF",
    input="Hello, world!",
)

print(response["data"][0]["embedding"])

# List models
models = client.models.list()
for model in models["data"]:
    print(model["id"])

Client Features

  • OpenAI Compatibility: Drop-in replacement for the OpenAI Python client
  • Streaming Support: Stream responses for chat completions and text completions
  • Embeddings: Generate embeddings from text
  • Model Management: List and retrieve available models
  • Error Handling: Comprehensive error handling with retries
  • Configuration Options: Customize timeout, retries, and headers

For more details, see the Python Client README.

📦 Requirements

Software Requirements

  • Python 3.9+
  • llama-cpp-python
  • FastAPI
  • Uvicorn
  • Rich (for terminal UI)
  • Typer (for CLI)
  • Hugging Face Hub
  • Pydantic
  • Requests

Hardware Requirements

  • Around 2 GB of RAM is needed for 1B models
  • Around 4 GB of RAM is needed for 3B models
  • You should have at least 8 GB of RAM available to run 7B models
  • 16 GB of RAM is recommended for 13B models
  • 32 GB of RAM is required for 33B models
  • GPU acceleration is recommended for better performance

Quantization Types and RAM Usage

Quantization Bits/Param RAM Multiplier Description
Q2_K ~2.5 1.15× 2-bit quantization (lowest quality, smallest size)
Q3_K_M ~3.5 1.28× 3-bit quantization (medium)
Q4_K_M ~4.5 1.40× 4-bit quantization (balanced quality/size)
Q5_K_M ~5.5 1.65× 5-bit quantization (better quality)
Q6_K ~6.5 1.80× 6-bit quantization (high quality)
Q8_0 ~8.5 2.00× 8-bit quantization (very high quality)
F16 16.0 2.80× 16-bit float (highest quality, largest size)

🔧 Advanced Configuration

Inferno allows you to configure various aspects of model loading and inference:

GPU Acceleration

# Set number of layers to offload to GPU
inferno serve HAI3-raw-Q4_K_M-GGUF --n_gpu_layers 32

Context Length

# Set custom context length
inferno serve HAI3-raw-Q4_K_M-GGUF --n_ctx 8192

Threading

# Set number of threads for inference
inferno serve HAI3-raw-Q4_K_M-GGUF --n_threads 8

Memory Options

# Use mlock to keep model in memory
inferno serve HAI3-raw-Q4_K_M-GGUF --use_mlock

🤝 Contributing

Contributions are welcome! If you'd like to contribute to Inferno, please follow these steps:

  1. Fork the repository
  2. Create a new branch for your feature or bug fix
  3. Make your changes and commit them with descriptive messages
  4. Push your branch to your forked repository
  5. Submit a pull request to the main repository

📄 License

This project is licensed under the HelpingAI Open Source License - a custom license that promotes open innovation and collaboration while ensuring responsible and ethical use of AI technology.


Made with ❤️ by HelpingAI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferno_llm-0.1.0.tar.gz (64.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inferno_llm-0.1.0-py3-none-any.whl (66.9 kB view details)

Uploaded Python 3

File details

Details for the file inferno_llm-0.1.0.tar.gz.

File metadata

  • Download URL: inferno_llm-0.1.0.tar.gz
  • Upload date:
  • Size: 64.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for inferno_llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1fc66b1c4c86b935c9d99eed8c68b6763f349c223db599c1f42ebfafb8aad507
MD5 b3833cb1cc8148158a3c91b729a83d96
BLAKE2b-256 d2e6d0ee2c57527cc243f12df4c084992f9cf93800a8c14e46e103d3aecb9b22

See more details on using hashes here.

File details

Details for the file inferno_llm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: inferno_llm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 66.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for inferno_llm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a088a421764aab1578b8ecc2f71548de9942818f0af118b8d1cd3398b8fc366d
MD5 823216be3b1208221efa9304b3f1c756
BLAKE2b-256 e5728303cac96559fae365b850604f7633ee561bfddcff9820bb6bf95f60ad48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page