Skip to main content

Ollama-like CLI wrapper around llama.cpp

Project description

llamacpp-cli

Ollama-like CLI wrapper around llama.cpp. Provides a simple command-line interface that mirrors Ollama's subcommands but powered by llama.cpp as the backend inference engine.

Features

  • pull - Download GGUF models from Hugging Face
  • run - Run models interactively using llama.cpp
  • serve - Start the llama.cpp server
  • lb-proxy - Multi-backend load balancer proxy (NEW!)
  • list - List downloaded models
  • ps - Show running llama.cpp processes
  • rm - Remove a downloaded model
  • search - Search Hugging Face for GGUF models
  • install - Install/update llama.cpp binaries

Installation

From PyPI

pip install llamacpp-cli

From Source

pip install -e .

Quick Start

1. Install llama.cpp binaries

llamacpp install

This downloads the latest llama.cpp release to ~/.llamacpp/bin/.

2. Pull a model

llamacpp pull unsloth/gemma-3-270m-it-GGUF:Q4_K_M

Or use a short alias:

llamacpp pull gemma3:270m

3. Run interactively

llamacpp run gemma3:270m

4. Start the server

llamacpp serve -m gemma3:270m

The server runs at http://0.0.0.0:8080 with OpenAI-compatible API.

CPU-Optimized Presets

For CPU-only servers, use presets optimized for different workloads:

# Code tasks (default): 16K context, 2-4 parallel requests
llamacpp serve --preset code

# Chat/conversational: 8K context, 4-6 parallel requests
llamacpp serve --preset chat

# Fast queries: 4K context, 6-8 parallel requests
llamacpp serve --preset fast

# Large codebases: 32K context, 1 parallel request (slower)
llamacpp serve --preset max-context

See CPU_OPTIMIZATION.md for detailed tuning guide.

Commands

llamacpp pull <model>      Download GGUF model from Hugging Face
llamacpp run <model>       Run a model interactively
llamacpp serve             Start the llama.cpp server
llamacpp lb-proxy          Start multi-backend load balancer (see LB_PROXY.md)
llamacpp list              List downloaded models
llamacpp ps                Show running processes
llamacpp rm <model>        Remove a model
llamacpp search <query>    Search for models on Hugging Face
llamacpp install           Install/update llama.cpp binaries

Load Balancer Proxy

For distributing requests across multiple machines, use the load balancer:

# Auto-discover backends on your network
llamacpp lb-proxy --discover-subnet 192.168.1.0/24

# Or specify backends manually
llamacpp lb-proxy -b http://machine1:8000 -b http://machine2:8000

See LB_PROXY.md for detailed documentation on:

  • Model-aware routing
  • Least-connections load balancing
  • Auto-discovery and health checks
  • Configuration options

Model Names

Model names can be specified in multiple ways:

  • Full Hugging Face path: unsloth/gemma-3-270m-it-GGUF:Q4_K_M
  • Short format: namespace/model:quantization (e.g., gemma3:270m)
  • Short name: gemma3:270m, qwen3, llama3:8b

Alias support is planned for future releases.

Configuration

  • Models are stored in ~/.llamacpp/models/
  • Binaries are installed to ~/.llamacpp/bin/
  • Database (SQLite) is at ~/.llamacpp/llamacpp.db

Environment Variables

Variable Description Default
LLAMACPP_BIN_DIR Directory for llama.cpp binaries ~/.llamacpp/bin
LLAMACPP_MODEL_DIR Directory for models ~/.llamacpp/models

Usage with LLM CLI

This package also registers as an LLM plugin for the llm CLI:

# Install the plugin (requires llm and llama-cpp-python)
pip install llm-llama-cpp llama-cpp-python

# Register a model
llm llama-cpp add-model ~/.llamacpp/models/gemma-3-270m-it-Q4_K_M.gguf --alias gemma3:270m

# Use with llm
llm -m gemma3:270m "Your prompt here"

Development

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run a single test file
pytest tests/test_foo.py

# Lint
ruff check .

# Format
ruff format .

Publishing to PyPI

Prerequisites

  1. Create a PyPI account at https://pypi.org/
  2. Install build tools:
pip install build twine

Build and Publish

  1. Update version in pyproject.toml:
[project]
version = "0.1.0"
  1. Build the package:
python -m build

This creates distributable archives in dist/.

  1. Upload to PyPI:
twine upload dist/*

You'll be prompted for your PyPI username and password.

For Test PyPI (testing first):

twine upload --repository testpypi dist/*

Using uv (Alternative)

# Install uv if not already
pip install uv

# Build
uv build

# Publish to PyPI
uv publish

# Or Test PyPI
uv publish --test

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llamacpp_cli-0.1.8.tar.gz (127.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llamacpp_cli-0.1.8-py3-none-any.whl (89.2 kB view details)

Uploaded Python 3

File details

Details for the file llamacpp_cli-0.1.8.tar.gz.

File metadata

  • Download URL: llamacpp_cli-0.1.8.tar.gz
  • Upload date:
  • Size: 127.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llamacpp_cli-0.1.8.tar.gz
Algorithm Hash digest
SHA256 5e9d7c10373f11132475b34a9e1b1347389e81c60a0b98797af3af98178c6c30
MD5 23cd7520163817eba851d823287a58cd
BLAKE2b-256 024dac91ab14d1e8370c725940181015d3856c5f152e8c6c9548875a6db49116

See more details on using hashes here.

Provenance

The following attestation bundles were made for llamacpp_cli-0.1.8.tar.gz:

Publisher: publish.yml on joeyjiaojg/llamacpp-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llamacpp_cli-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: llamacpp_cli-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 89.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llamacpp_cli-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 baafc5c432e4757c1a7ae70cb1b525f98a2bf1e69a235e79f6e8cc8e8d7096ff
MD5 ca43f9eee60c21b8313fa6dc903b8521
BLAKE2b-256 0797706b7621d44cc09b81d637fc796ad9e6f2b0261f9caad1e180b0bf554525

See more details on using hashes here.

Provenance

The following attestation bundles were made for llamacpp_cli-0.1.8-py3-none-any.whl:

Publisher: publish.yml on joeyjiaojg/llamacpp-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page