Python wrapper for running the llama.cpp server

These details have not been verified by PyPI

Project links

Repository

Project description

llama-cpp-py

Python wrapper for running the llama.cpp server with automatic or manual binary management.
Runs the server in a separate subprocess supporting both synchronous and asynchronous APIs.

Requirements

Python 3.10 or higher.

Installation

From PyPI

pip install llama-cpp-py

From source

git clone https://github.com/sergey21000/llama-cpp-py
cd llama-cpp-py
pip install -e .

Using UV

uv pip install llama-cpp-py

Quick Start

More examples in the Google Colab notebook

1. Set up environment file for llama.cpp

Creating an .llama.env file with variables for llama.cpp server

# download example env file
wget https://github.com/sergey21000/llama-cpp-py/raw/main/.llama.env

# or create manually
nano .llama.env

See example .llama.env

2. Launch the server and send requests

Launching a synchronous server based on the latest llama.cpp release version

import os
from dotenv import load_dotenv
from openai import OpenAI
from llama_cpp_py import LlamaSyncServer


# environment variables for llama.cpp
load_dotenv(dotenv_path='.llama.env')

# auto-download last release and start server
# set verbose=True to display server logs
server = LlamaSyncServer()
server.start(verbose=True)


# sending requests with OpenAI client
client = OpenAI(
	base_url=server.server_url + '/v1',
	api_key='sk-no-key-required',
)
response = client.chat.completions.create(
    model='local',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

# stopping the server
server.stop()

Launching an asynchronous server based on a specific release version

import os
import asyncio
from openai import AsyncOpenAI
from dotenv import load_dotenv
from llama_cpp_py import LlamaAsyncServer, LlamaReleaseManager


# environment variables for llama.cpp
load_dotenv(dotenv_path='.llama.env')

# a) download a release by a specific tag with the 'cuda' priority in the title
# set tag='latest' to use the latest llama.cpp release version
# optionally specify priority_patterns to prefer certain builds (e.g. 'cuda' or 'cpu')
release_manager = LlamaReleaseManager(tag='b6780', priority_patterns=['cuda'])

# b) or set a specific release url in zip format
# release_manager = LlamaReleaseManager(
#     release_zip_url='https://github.com/ggml-org/llama.cpp/releases/download/b6780/llama-b6780-bin-win-cuda-12.4-x64.zip'
# )

# c) or selecting the compiled directory llama.cpp
# release_manager = LlamaReleaseManager(release_dir='/content/llama.cpp/build/bin')
	
async def main():
    # start llama.cpp server (set verbose=True to display server logs)
    llama_server = LlamaAsyncServer(verbose=False, release_manager=release_manager)
    await llama_server.start()

    # sending requests with OpenAI client
    client = AsyncOpenAI(
        base_url=f'{llama_server.server_url}/v1',
        api_key='sk-no-key-required',
    )
    stream_response = await client.chat.completions.create(
        model='local',
        messages=[{'role': 'user', 'content': 'How are you?'}],
        stream=True,
        temperature=0.8,
        max_tokens=-1,
        extra_body=dict(
            top_k=40,
            reasoning_format='none',
            chat_template_kwargs=dict(
                enable_thinking=True,
            ),
        ),
    )
    full_response = ''
    async for chunk in stream_response:
        if (token := chunk.choices[0].delta.content) is not None:
            full_response += token
            print(token, end='', flush=True)

    # stopping the server
    await llama_server.stop()

if __name__ == '__main__':
    asyncio.run(main())

Use Context manager

import os
from openai import AsyncOpenAI
from llama_cpp_py import LlamaAsyncServer

os.environ['LLAMA_ARG_MODEL_URL'] = 'https://huggingface.co/bartowski/google_gemma-3-4b-it-GGUF/resolve/main/google_gemma-3-4b-it-Q4_K_S.gguf'

async with LlamaAsyncServer() as server:
    client = AsyncOpenAI(
        base_url=f'{server.server_url}/v1',
        api_key='sk-no-key-required',
    )
    stream_response = await client.chat.completions.create(
        model='local',
        messages=[{'role': 'user', 'content': 'Hello!'}],
        stream=True,
    )
    full_response = ''
    async for chunk in stream_response:
        if (token := chunk.choices[0].delta.content) is not None:
            full_response += token
            print(token, end='', flush=True)

Enviroment Variables

Environment variables for llama-cpp-py

[!NOTE] Function arguments override environment variables. For example:

server = LlamaSyncServer(llama_dir='/path/bin')

will take precedence over the LLAMACPP_DIR variable

# Server startup wait timeout in seconds.
# Increase if model loading takes a long time.
# (default: 300)
LLAMACPP_SERVER_TIMEOUT_WAIT=900

# llama.cpp release tag. If set to "latest", the most recent release will be downloaded.
# (default: "latest")
LLAMACPP_RELEASE_TAG=b7806

# Direct download link to the archive from the llama.cpp releases page.
# Takes higher priority than LLAMACPP_RELEASE_TAG.
# (default: "")
LLAMACPP_RELEASE_ZIP_URL=https://github.com/ggml-org/llama.cpp/releases/download/b7806/llama-b7806-bin-win-cuda-13.1-x64.zip

# Path to a precompiled llama.cpp directory.
# Takes the highest priority, overriding LLAMACPP_RELEASE_TAG and LLAMACPP_RELEASE_ZIP_URL.
# (default: "")
LLAMACPP_DIR="/content/llama.cpp/build/bin"

# Logging level for llama-cpp-py (uses loguru, default INFO).
# A separate global setup via logger.add() also works.
# (default: "")
LLAMACPP_LOG_LEVEL=DEBUG

# or set global loguru level
LOGURU_LEVEL=WARNING

Troubleshooting

If the server fails to start or behaves unexpectedly, check the following:

Check that the model path or URL in .llama.env is correct
Verify that the port is not already in use
Try setting verbose=True to see server logs

llama_server = LlamaAsyncServer(verbose=True)

Link to the llama.cpp release archive appropriate for your system via

LlamaReleaseManager(release_zip_url=url)

Or use the path to the directory with the pre-compiled llama.cpp

LlamaReleaseManager(release_dir=path_to_binaries)

If the model is being downloaded from a URL and the server times out before it finishes loading, you can:

Increase the startup timeout by setting the environment variable

import os
os.environ['TIMEOUT_WAIT_FOR_SERVER'] = 600  # default 300

(value is in seconds), or

Pre-download the model manually and set its local path in

import os
os.environ['LLAMA_ARG_MODEL'] = 'C:\path\to\model.gguf'

llama.cpp binary releases are downloaded to:

Windows

%LOCALAPPDATA%\llama-cpp-py\releases

Linux

~/.local/share/llama-cpp-py/releases

MacOS

~/Library/Application Support/llama-cpp-py/releases

See platformdirs examle output

Dependencies

aiohttp - Asynchronous HTTP client, used to check llama.cpp server readiness and interact with the server in async mode.
requests - Synchronous HTTP client, used to check llama.cpp server readiness and interact with the server in sync mode.
tqdm - Progress bar utility, used to display download progress when fetching llama.cpp releases.
openai-python - OpenAI-compatible client, used to provide an OpenAI-style API interface for the server.
python-dotenv - Environment variable loader, used for configuration via .env files.
platformdirs - Cross-platform directory management, used to determine cache and data storage locations.
pillow - Image processing library, used for multimodal (vision) input support.
loguru - logging

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.32

Apr 22, 2026

0.1.31

Mar 1, 2026

0.1.30

Feb 26, 2026

0.1.29

Feb 24, 2026

0.1.28

Feb 23, 2026

0.1.27

Feb 8, 2026

0.1.26

Feb 8, 2026

0.1.25

Feb 6, 2026

0.1.24

Jan 29, 2026

0.1.23

Jan 22, 2026

0.1.22

Jan 11, 2026

0.1.21

Jan 11, 2026

0.1.20

Dec 9, 2025

0.1.19

Dec 9, 2025

0.1.18

Dec 8, 2025

0.1.17

Dec 6, 2025

0.1.16

Dec 5, 2025

0.1.15

Nov 12, 2025

0.1.14

Nov 9, 2025

0.1.13

Oct 22, 2025

0.1.12

Oct 20, 2025

0.1.11

Oct 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_cpp_py-0.1.32.tar.gz (26.3 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_cpp_py-0.1.32-py3-none-any.whl (30.3 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file llama_cpp_py-0.1.32.tar.gz.

File metadata

Download URL: llama_cpp_py-0.1.32.tar.gz
Upload date: Apr 22, 2026
Size: 26.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llama_cpp_py-0.1.32.tar.gz
Algorithm	Hash digest
SHA256	`f5bc6e0c012e390af7d4b822d255dfe20f90f2041754017858c69ea65e0aaa5f`
MD5	`0cd40e2cef0ba8cc0946118c29967396`
BLAKE2b-256	`673d8afe88e7d4f576a569426742f3aed1e21962a06eb9b9e3d4ff3eeb361af2`

See more details on using hashes here.

File details

Details for the file llama_cpp_py-0.1.32-py3-none-any.whl.

File metadata

Download URL: llama_cpp_py-0.1.32-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 30.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llama_cpp_py-0.1.32-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a736227610a2ea4035da213f9c7d26379b43e05dfa9dcbe9e50c91cd6ff99d79`
MD5	`ad37c21e3bdbe6d76dd750b2eacb7f9c`
BLAKE2b-256	`87a589ca70d6661cf797107012b2810702207dac9d21b446e9a9a96c1317fff1`

See more details on using hashes here.

llama-cpp-py 0.1.32

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llama-cpp-py

Requirements

Installation

Quick Start

1. Set up environment file for llama.cpp

2. Launch the server and send requests

Enviroment Variables

Troubleshooting

Dependencies

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes