Skip to main content

Python wrapper for llama.cpp server - OpenAI API compatible backend with auto device detection

Project description

pyllama-server

Python wrapper for llama.cpp server - OpenAI API compatible backend with automatic device detection and model downloading.

Features

  • Auto Device Detection: Automatically detects and configures the best GPU backend (Vulkan, CUDA, ROCm, Metal)
  • Model Downloading: Download GGUF models from HuggingFace and ModelScope
  • OpenAI API Compatible: Drop-in replacement for OpenAI API
  • Function Calling: Support for tools and function calling
  • Pre-built Binaries: Automatically downloads pre-built llama.cpp binaries

Installation

pip install pyllama-server

Quick Start

Command Line

# List available GPUs
pyllama devices

# Run inference with a model
pyllama run ./model.gguf -p "Hello, world!"

# Download a model
pyllama download llama-3.2-3b -q Q4_K_M

# Start an OpenAI-compatible server
pyllama serve llama-3.2-3b -p 8080

Python API

from pyllama import quick_run, quick_server, Client

# Quick inference
result = quick_run("llama-3.2-3b", "Write a haiku about coding")
print(result)

# Start server with auto-configuration
with quick_server("llama-3.2-3b") as server:
    client = Client(server.base_url)
    
    response = client.chat.completions.create(
        model="llama",
        messages=[{"role": "user", "content": "Hello!"}]
    )
    print(response.choices[0]["message"]["content"])

Auto Device Detection

from pyllama import DeviceDetector, AutoRunner

# Detect available GPUs
detector = DeviceDetector()
devices = detector.detect()
for device in devices:
    print(f"{device.name} ({device.backend.value}, {device.memory_gb:.1f}GB)")

# Get optimal configuration
config = detector.get_best_device(model_size_gb=5.0)
print(f"Best device: {config.device.name}")
print(f"Recommended GPU layers: {config.n_gpu_layers}")

Model Download

from pyllama import ModelDownloader

downloader = ModelDownloader()

# Download from HuggingFace
path = downloader.download("Qwen/Qwen2.5-7B-Instruct-GGUF", "Q4_K_M.gguf")

# Download from ModelScope
path = downloader.download(
    "LLM-Research/Meta-Llama-3-8B-Instruct-GGUF",
    "Meta-Llama-3-8B-Instruct-Q4_K_M.gguf",
    source="modelscope"
)

CLI Commands

Command Description
pyllama serve Start OpenAI-compatible server
pyllama run Run inference with a model
pyllama chat Interactive chat with a model
pyllama download Download a model
pyllama models List available models
pyllama devices List GPU devices
pyllama config Show optimal configuration
pyllama download-binaries Download pre-built binaries
pyllama build Build binaries from source
pyllama clear-cache Clear model/binary cache

Popular Models

Name Description Sizes
llama-3.2-3b Llama 3.2 3B Q4_K_M, Q5_K_M, Q8_0
llama-3.1-8b Llama 3.1 8B Q4_K_M, Q5_K_M, Q8_0
qwen2.5-7b Qwen 2.5 7B Q4_K_M, Q5_K_M, Q8_0
gemma-2-9b Gemma 2 9B Q4_K_M, Q5_K_M, Q8_0
mistral-7b Mistral 7B v0.3 Q4_K_M, Q5_K_M, Q8_0
phi-3.5-mini Phi 3.5 Mini Q4_K_M, Q5_K_M, Q8_0
deepseek-coder-6.7b DeepSeek Coder Q4_K_M, Q5_K_M, Q8_0

GPU Backends

Backend Platforms Description
Vulkan Windows, Linux Cross-platform GPU API, works on AMD, Intel, NVIDIA
CUDA Windows, Linux NVIDIA GPUs
ROCm Linux AMD GPUs
Metal macOS Apple Silicon
CPU All Fallback, no GPU required

Function Calling

from pyllama import Client, LlamaServer

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            }
        }
    }
}]

with LlamaServer("model.gguf") as server:
    client = Client(server.base_url)
    
    response = client.chat.completions.create(
        model="llama",
        messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
        tools=tools
    )
    
    if response.choices[0]["message"].get("tool_calls"):
        print("Model wants to call:", response.choices[0]["message"]["tool_calls"])

Requirements

  • Python 3.8+
  • Vulkan SDK (for Vulkan backend)
  • CUDA Toolkit (for CUDA backend)
  • ROCm (for ROCm backend)

License

MIT License - same as llama.cpp

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyllama_server-0.1.0.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyllama_server-0.1.0-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file pyllama_server-0.1.0.tar.gz.

File metadata

  • Download URL: pyllama_server-0.1.0.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pyllama_server-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ed06fe2c266e616307538f24a521a9511248240eb0011c5eca3a43dc247e9eb1
MD5 185165cd8d1b2d5ae44e42b697c6fd45
BLAKE2b-256 b01fa9674c87297dc1124e93e71c4767d1dd5e2d31ed17a3905bc57f4d74630c

See more details on using hashes here.

File details

Details for the file pyllama_server-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyllama_server-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pyllama_server-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f2aad7d1595b025b852a06844d5f189349cc01c23c96786c4f399373cf4d8132
MD5 2de6ea7c2b168faaf283dd29085ac2fe
BLAKE2b-256 378dbc63a0ead4287447f7c36edae846c9d6b217fe9b8d98d119ec1afaa25ba6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page