Skip to main content

Python wrapper for running the llama.cpp server

Project description

llama-cpp-py

PyPI version PyPI - Python Version

Python wrapper for running the llama.cpp server with automatic or manual binary management.
Runs the server in a separate subprocess supporting both synchronous and asynchronous APIs.

Requirements

Python 3.10 or higher.

Installation

From PyPI

pip install llama-cpp-py

From source

git clone https://github.com/sergey21000/llama-cpp-py
cd llama-cpp-py
pip install -e .

Using UV

uv pip install llama-cpp-py

Quick Start

More examples in the Google Colab notebook Open in Colab

1. Set up environment file for llama.cpp

Creating an .llama.env file with variables for llama.cpp server

# download example env file
wget https://github.com/sergey21000/llama-cpp-py/raw/main/.llama.env

# or create manually
nano .llama.env

See example .llama.env

2. Launch the server and send requests

Launching a synchronous server based on the latest llama.cpp release version

import os
from dotenv import load_dotenv
from openai import OpenAI
from llama_cpp_py import LlamaSyncServer


# environment variables for llama.cpp
load_dotenv(dotenv_path='.llama.env')

# auto-download last release and start server
# set verbose=True to display server logs
server = LlamaSyncServer()
server.start(verbose=True)


# sending requests with OpenAI client
client = OpenAI(
	base_url=server.server_url + '/v1',
	api_key='sk-no-key-required',
)
response = client.chat.completions.create(
    model='local',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

# stopping the server
server.stop()

Launching an asynchronous server based on a specific release version

import os
import asyncio
from openai import AsyncOpenAI
from dotenv import load_dotenv
from llama_cpp_py import LlamaAsyncServer, LlamaReleaseManager


# environment variables for llama.cpp
load_dotenv(dotenv_path='.llama.env')

# a) download a release by a specific tag with the 'cuda' priority in the title
# set tag='latest' to use the latest llama.cpp release version
# optionally specify priority_patterns to prefer certain builds (e.g. 'cuda' or 'cpu')
release_manager = LlamaReleaseManager(tag='b6780', priority_patterns=['cuda'])

# b) or set a specific release url in zip format
# release_manager = LlamaReleaseManager(
#     release_zip_url='https://github.com/ggml-org/llama.cpp/releases/download/b6780/llama-b6780-bin-win-cuda-12.4-x64.zip'
# )

# c) or selecting the compiled directory llama.cpp
# release_manager = LlamaReleaseManager(release_dir='/content/llama.cpp/build/bin')
	
async def main():
    # start llama.cpp server (set verbose=True to display server logs)
    llama_server = LlamaAsyncServer(verbose=False, release_manager=release_manager)
    await llama_server.start()

    # sending requests with OpenAI client
    client = AsyncOpenAI(
        base_url=f'{llama_server.server_url}/v1',
        api_key='sk-no-key-required',
    )
    stream_response = await client.chat.completions.create(
        model='local',
        messages=[{'role': 'user', 'content': 'How are you?'}],
        stream=True,
        temperature=0.8,
        max_tokens=-1,
        extra_body=dict(
            top_k=40,
            reasoning_format='none',
            chat_template_kwargs=dict(
                enable_thinking=True,
            ),
        ),
    )
    full_response = ''
    async for chunk in stream_response:
        if (token := chunk.choices[0].delta.content) is not None:
            full_response += token
            print(token, end='', flush=True)

    # stopping the server
    await llama_server.stop()

if __name__ == '__main__':
    asyncio.run(main())

Use Context manager

import os
from openai import AsyncOpenAI
from llama_cpp_py import LlamaAsyncServer

os.environ['LLAMA_ARG_MODEL_URL'] = 'https://huggingface.co/bartowski/google_gemma-3-4b-it-GGUF/resolve/main/google_gemma-3-4b-it-Q4_K_S.gguf'

async with LlamaAsyncServer() as server:
    client = AsyncOpenAI(
        base_url=f'{server.server_url}/v1',
        api_key='sk-no-key-required',
    )
    stream_response = await client.chat.completions.create(
        model='local',
        messages=[{'role': 'user', 'content': 'Hello!'}],
        stream=True,
    )
    full_response = ''
    async for chunk in stream_response:
        if (token := chunk.choices[0].delta.content) is not None:
            full_response += token
            print(token, end='', flush=True)

Enviroment Variables

Environment variables for llama-cpp-py

[!NOTE] Function arguments override environment variables. For example:

server = LlamaSyncServer(llama_dir='/path/bin')

will take precedence over the LLAMACPP_DIR variable

# Server startup wait timeout in seconds.
# Increase if model loading takes a long time.
# (default: 300)
LLAMACPP_SERVER_TIMEOUT_WAIT=900

# llama.cpp release tag. If set to "latest", the most recent release will be downloaded.
# (default: "latest")
LLAMACPP_RELEASE_TAG=b7806

# Direct download link to the archive from the llama.cpp releases page.
# Takes higher priority than LLAMACPP_RELEASE_TAG.
# (default: "")
LLAMACPP_RELEASE_ZIP_URL=https://github.com/ggml-org/llama.cpp/releases/download/b7806/llama-b7806-bin-win-cuda-13.1-x64.zip

# Path to a precompiled llama.cpp directory.
# Takes the highest priority, overriding LLAMACPP_RELEASE_TAG and LLAMACPP_RELEASE_ZIP_URL.
# (default: "")
LLAMACPP_DIR="/content/llama.cpp/build/bin"

# Logging level for llama-cpp-py (uses loguru, default INFO).
# A separate global setup via logger.add() also works.
# (default: "")
LLAMACPP_LOG_LEVEL=DEBUG

# or set global loguru level
LOGURU_LEVEL=WARNING

Troubleshooting

If the server fails to start or behaves unexpectedly, check the following:

  • Check that the model path or URL in .llama.env is correct
  • Verify that the port is not already in use
  • Try setting verbose=True to see server logs
llama_server = LlamaAsyncServer(verbose=True)
LlamaReleaseManager(release_zip_url=url)
  • Or use the path to the directory with the pre-compiled llama.cpp
LlamaReleaseManager(release_dir=path_to_binaries)

If the model is being downloaded from a URL and the server times out before it finishes loading, you can:

  • Increase the startup timeout by setting the environment variable
import os
os.environ['TIMEOUT_WAIT_FOR_SERVER'] = 600  # default 300

(value is in seconds), or

  • Pre-download the model manually and set its local path in
import os
os.environ['LLAMA_ARG_MODEL'] = 'C:\path\to\model.gguf'

llama.cpp binary releases are downloaded to:

  • Windows
%LOCALAPPDATA%\llama-cpp-py\releases
  • Linux
~/.local/share/llama-cpp-py/releases
  • MacOS
~/Library/Application Support/llama-cpp-py/releases

See platformdirs examle output

Dependencies

  • aiohttp - Asynchronous HTTP client, used to check llama.cpp server readiness and interact with the server in async mode.
  • requests - Synchronous HTTP client, used to check llama.cpp server readiness and interact with the server in sync mode.
  • tqdm - Progress bar utility, used to display download progress when fetching llama.cpp releases.
  • openai-python - OpenAI-compatible client, used to provide an OpenAI-style API interface for the server.
  • python-dotenv - Environment variable loader, used for configuration via .env files.
  • platformdirs - Cross-platform directory management, used to determine cache and data storage locations.
  • pillow - Image processing library, used for multimodal (vision) input support.
  • loguru - logging

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_cpp_py-0.1.32.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_cpp_py-0.1.32-py3-none-any.whl (30.3 kB view details)

Uploaded Python 3

File details

Details for the file llama_cpp_py-0.1.32.tar.gz.

File metadata

  • Download URL: llama_cpp_py-0.1.32.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llama_cpp_py-0.1.32.tar.gz
Algorithm Hash digest
SHA256 f5bc6e0c012e390af7d4b822d255dfe20f90f2041754017858c69ea65e0aaa5f
MD5 0cd40e2cef0ba8cc0946118c29967396
BLAKE2b-256 673d8afe88e7d4f576a569426742f3aed1e21962a06eb9b9e3d4ff3eeb361af2

See more details on using hashes here.

File details

Details for the file llama_cpp_py-0.1.32-py3-none-any.whl.

File metadata

  • Download URL: llama_cpp_py-0.1.32-py3-none-any.whl
  • Upload date:
  • Size: 30.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llama_cpp_py-0.1.32-py3-none-any.whl
Algorithm Hash digest
SHA256 a736227610a2ea4035da213f9c7d26379b43e05dfa9dcbe9e50c91cd6ff99d79
MD5 ad37c21e3bdbe6d76dd750b2eacb7f9c
BLAKE2b-256 87a589ca70d6661cf797107012b2810702207dac9d21b446e9a9a96c1317fff1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page