Skip to main content

Python wrapper for running the llama.cpp server

Project description

llama-cpp-py

PyPI version PyPI - Python Version

Python wrapper for running the llama.cpp server with automatic or manual binary management.
Runs the server in a separate subprocess supporting both synchronous and asynchronous APIs.

Requirements

Python 3.10 or higher.

Installation

From PyPI

pip install llama-cpp-py

From source

git clone https://github.com/sergey21000/llama-cpp-py
cd llama-cpp-py
pip install -e .

Using UV

uv pip install llama-cpp-py

Quick Start

More examples in the Google Colab notebook Open in Colab

1. Set up environment file for llama.cpp

Creating an .llama.env file with variables for llama.cpp server

# download example env file
wget https://github.com/sergey21000/llama-cpp-py/raw/main/.llama.env

# or create manually
nano .llama.env

See example .llama.env

2. Launch the server and send requests

Launching a synchronous server based on the latest llama.cpp release version

import os
from dotenv import load_dotenv
from openai import OpenAI
from llama_cpp_py import LlamaSyncServer


# environment variables for llama.cpp
load_dotenv(dotenv_path='.llama.env')

# auto-download last release and start server
# set verbose=True to display server logs
server = LlamaSyncServer()
server.start(verbose=True)


# sending requests with OpenAI client
client = OpenAI(
	base_url=server.server_url + '/v1',
	api_key='sk-no-key-required',
)
response = client.chat.completions.create(
    model='local',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

# stopping the server
server.stop()

Launching an asynchronous server based on a specific release version

import os
import asyncio
from openai import AsyncOpenAI
from dotenv import load_dotenv
from llama_cpp_py import LlamaAsyncServer, LlamaReleaseManager


# environment variables for llama.cpp
load_dotenv(dotenv_path='.llama.env')

# a) download a release by a specific tag with the 'cuda' priority in the title
# set tag='latest' to use the latest llama.cpp release version
# optionally specify priority_patterns to prefer certain builds (e.g. 'cuda' or 'cpu')
release_manager = LlamaReleaseManager(tag='b6780', priority_patterns=['cuda'])

# b) or set a specific release url in zip format
# release_manager = LlamaReleaseManager(
#     release_zip_url='https://github.com/ggml-org/llama.cpp/releases/download/b6780/llama-b6780-bin-win-cuda-12.4-x64.zip'
# )

# c) or selecting the compiled directory llama.cpp
# release_manager = LlamaReleaseManager(release_dir='/content/llama.cpp/build/bin')
	
async def main():
    # start llama.cpp server (set verbose=True to display server logs)
    llama_server = LlamaAsyncServer(verbose=False, release_manager=release_manager)
    await llama_server.start()

    # sending requests with OpenAI client
    client = AsyncOpenAI(
        base_url=f'{llama_server.server_url}/v1',
        api_key='sk-no-key-required',
    )
    stream_response = await client.chat.completions.create(
        model='local',
        messages=[{'role': 'user', 'content': 'How are you?'}],
        stream=True,
        temperature=0.8,
        max_tokens=-1,
        extra_body=dict(
            top_k=40,
            reasoning_format='none',
            chat_template_kwargs=dict(
                enable_thinking=True,
            ),
        ),
    )
    full_response = ''
    async for chunk in stream_response:
        if (token := chunk.choices[0].delta.content) is not None:
            full_response += token
            print(token, end='', flush=True)

    # stopping the server
    await llama_server.stop()

if __name__ == '__main__':
    asyncio.run(main())

Use Context manager

import os
from openai import AsyncOpenAI
from llama_cpp_py import LlamaAsyncServer

os.environ['LLAMA_ARG_MODEL_URL'] = 'https://huggingface.co/bartowski/google_gemma-3-4b-it-GGUF/resolve/main/google_gemma-3-4b-it-Q4_K_S.gguf'

async with LlamaAsyncServer() as server:
    client = AsyncOpenAI(
        base_url=f'{server.server_url}/v1',
        api_key='sk-no-key-required',
    )
    stream_response = await client.chat.completions.create(
        model='local',
        messages=[{'role': 'user', 'content': 'Hello!'}],
        stream=True,
    )
    full_response = ''
    async for chunk in stream_response:
        if (token := chunk.choices[0].delta.content) is not None:
            full_response += token
            print(token, end='', flush=True)

Enviroment Variables

Environment variables for llama-cpp-py

[!NOTE] Function arguments override environment variables. For example:

server = LlamaSyncServer(llama_dir='/path/bin')

will take precedence over the LLAMACPP_DIR variable

# Server startup wait timeout in seconds.
# Increase if model loading takes a long time.
# (default: 300)
LLAMACPP_SERVER_TIMEOUT_WAIT=900

# llama.cpp release tag. If set to "latest", the most recent release will be downloaded.
# (default: "latest")
LLAMACPP_RELEASE_TAG=b7806

# Direct download link to the archive from the llama.cpp releases page.
# Takes higher priority than LLAMACPP_RELEASE_TAG.
# (default: "")
LLAMACPP_RELEASE_ZIP_URL=https://github.com/ggml-org/llama.cpp/releases/download/b7806/llama-b7806-bin-win-cuda-13.1-x64.zip

# Path to a precompiled llama.cpp directory.
# Takes the highest priority, overriding LLAMACPP_RELEASE_TAG and LLAMACPP_RELEASE_ZIP_URL.
# (default: "")
LLAMACPP_DIR="/content/llama.cpp/build/bin"

# Logging level for llama-cpp-py (uses loguru, default INFO).
# A separate global setup via logger.add() also works.
# (default: "")
LLAMACPP_LOG_LEVEL=DEBUG

# or set global loguru level
LOGURU_LEVEL=WARNING

Troubleshooting

If the server fails to start or behaves unexpectedly, check the following:

  • Check that the model path or URL in .llama.env is correct
  • Verify that the port is not already in use
  • Try setting verbose=True to see server logs
llama_server = LlamaAsyncServer(verbose=True)
LlamaReleaseManager(release_zip_url=url)
  • Or use the path to the directory with the pre-compiled llama.cpp
LlamaReleaseManager(release_dir=path_to_binaries)

If the model is being downloaded from a URL and the server times out before it finishes loading, you can:

  • Increase the startup timeout by setting the environment variable
import os
os.environ['TIMEOUT_WAIT_FOR_SERVER'] = 600  # default 300

(value is in seconds), or

  • Pre-download the model manually and set its local path in
import os
os.environ['LLAMA_ARG_MODEL'] = 'C:\path\to\model.gguf'

llama.cpp binary releases are downloaded to:

  • Windows
%LOCALAPPDATA%\llama-cpp-py\releases
  • Linux
~/.local/share/llama-cpp-py/releases
  • MacOS
~/Library/Application Support/llama-cpp-py/releases

See platformdirs examle output

Dependencies

  • aiohttp - Asynchronous HTTP client, used to check llama.cpp server readiness and interact with the server in async mode.
  • requests - Synchronous HTTP client, used to check llama.cpp server readiness and interact with the server in sync mode.
  • tqdm - Progress bar utility, used to display download progress when fetching llama.cpp releases.
  • openai-python - OpenAI-compatible client, used to provide an OpenAI-style API interface for the server.
  • python-dotenv - Environment variable loader, used for configuration via .env files.
  • platformdirs - Cross-platform directory management, used to determine cache and data storage locations.
  • pillow - Image processing library, used for multimodal (vision) input support.
  • loguru - logging

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_cpp_py-0.1.31.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_cpp_py-0.1.31-py3-none-any.whl (30.3 kB view details)

Uploaded Python 3

File details

Details for the file llama_cpp_py-0.1.31.tar.gz.

File metadata

  • Download URL: llama_cpp_py-0.1.31.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llama_cpp_py-0.1.31.tar.gz
Algorithm Hash digest
SHA256 0ccf066a5d198699779eee9151e3c3874f6ec3a28a2e7a5f70ab1baa80e8765c
MD5 6eb6d9242393eea79bef43a43dd8d14f
BLAKE2b-256 60c0aa97cdf56368b7d2dff2f37d90f0de4dc74224acccc8a3b08195b89756ab

See more details on using hashes here.

File details

Details for the file llama_cpp_py-0.1.31-py3-none-any.whl.

File metadata

  • Download URL: llama_cpp_py-0.1.31-py3-none-any.whl
  • Upload date:
  • Size: 30.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llama_cpp_py-0.1.31-py3-none-any.whl
Algorithm Hash digest
SHA256 92b42e67d4d973daf16d7c3f4a1f31f5a112929755a17c1b0472ebd981fb0814
MD5 87cd7b5c13a867f3bd88fb3f2a0905a2
BLAKE2b-256 d2f20563f385ac70f2be88582340843d5fdf08ecf0c64662a1e21d685dfb2ec8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page