Ollama-like CLI wrapper around llama.cpp
Project description
llamacpp-cli
Ollama-like CLI wrapper around llama.cpp. Provides a simple command-line interface that mirrors Ollama's subcommands but powered by llama.cpp as the backend inference engine.
Features
- pull - Download GGUF models from Hugging Face
- run - Run models interactively using llama.cpp
- serve - Start the llama.cpp server
- lb-proxy - Multi-backend load balancer proxy (NEW!)
- list - List downloaded models
- ps - Show running llama.cpp processes
- rm - Remove a downloaded model
- search - Search Hugging Face for GGUF models
- install - Install/update llama.cpp binaries
Installation
From PyPI
pip install llamacpp-cli
From Source
pip install -e .
Quick Start
1. Install llama.cpp binaries
llamacpp install
This downloads the latest llama.cpp release to ~/.llamacpp/bin/.
2. Pull a model
llamacpp pull unsloth/gemma-3-270m-it-GGUF:Q4_K_M
Or use a short alias:
llamacpp pull gemma3:270m
3. Run interactively
llamacpp run gemma3:270m
4. Start the server
llamacpp serve -m gemma3:270m
The server runs at http://0.0.0.0:8080 with OpenAI-compatible API.
CPU-Optimized Presets
For CPU-only servers, use presets optimized for different workloads:
# Code tasks (default): 16K context, 2-4 parallel requests
llamacpp serve --preset code
# Chat/conversational: 8K context, 4-6 parallel requests
llamacpp serve --preset chat
# Fast queries: 4K context, 6-8 parallel requests
llamacpp serve --preset fast
# Large codebases: 32K context, 1 parallel request (slower)
llamacpp serve --preset max-context
See CPU_OPTIMIZATION.md for detailed tuning guide.
Commands
llamacpp pull <model> Download GGUF model from Hugging Face
llamacpp run <model> Run a model interactively
llamacpp serve Start the llama.cpp server
llamacpp lb-proxy Start multi-backend load balancer (see LB_PROXY.md)
llamacpp list List downloaded models
llamacpp ps Show running processes
llamacpp rm <model> Remove a model
llamacpp search <query> Search for models on Hugging Face
llamacpp install Install/update llama.cpp binaries
Load Balancer Proxy
For distributing requests across multiple machines, use the load balancer:
# Auto-discover backends on your network
llamacpp lb-proxy --discover-subnet 192.168.1.0/24
# Or specify backends manually
llamacpp lb-proxy -b http://machine1:8000 -b http://machine2:8000
See LB_PROXY.md for detailed documentation on:
- Model-aware routing
- Least-connections load balancing
- Auto-discovery and health checks
- Configuration options
Model Names
Model names can be specified in multiple ways:
- Full Hugging Face path:
unsloth/gemma-3-270m-it-GGUF:Q4_K_M - Short format:
namespace/model:quantization(e.g.,gemma3:270m) - Short name:
gemma3:270m,qwen3,llama3:8b
Alias support is planned for future releases.
Configuration
- Models are stored in
~/.llamacpp/models/ - Binaries are installed to
~/.llamacpp/bin/ - Database (SQLite) is at
~/.llamacpp/llamacpp.db
Environment Variables
| Variable | Description | Default |
|---|---|---|
LLAMACPP_BIN_DIR |
Directory for llama.cpp binaries | ~/.llamacpp/bin |
LLAMACPP_MODEL_DIR |
Directory for models | ~/.llamacpp/models |
Usage with LLM CLI
This package also registers as an LLM plugin for the llm CLI:
# Install the plugin (requires llm and llama-cpp-python)
pip install llm-llama-cpp llama-cpp-python
# Register a model
llm llama-cpp add-model ~/.llamacpp/models/gemma-3-270m-it-Q4_K_M.gguf --alias gemma3:270m
# Use with llm
llm -m gemma3:270m "Your prompt here"
Development
# Install in editable mode with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run a single test file
pytest tests/test_foo.py
# Lint
ruff check .
# Format
ruff format .
Publishing to PyPI
Prerequisites
- Create a PyPI account at https://pypi.org/
- Install build tools:
pip install build twine
Build and Publish
- Update version in
pyproject.toml:
[project]
version = "0.1.0"
- Build the package:
python -m build
This creates distributable archives in dist/.
- Upload to PyPI:
twine upload dist/*
You'll be prompted for your PyPI username and password.
For Test PyPI (testing first):
twine upload --repository testpypi dist/*
Using uv (Alternative)
# Install uv if not already
pip install uv
# Build
uv build
# Publish to PyPI
uv publish
# Or Test PyPI
uv publish --test
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llamacpp_cli-0.1.7.tar.gz.
File metadata
- Download URL: llamacpp_cli-0.1.7.tar.gz
- Upload date:
- Size: 126.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
374d5acbe22859fe800a93bb527be3bc1d297b8a9f014218ef9efdc97d239457
|
|
| MD5 |
b492ea210017e15de5a103e8ca965583
|
|
| BLAKE2b-256 |
f733d272498c8510a7acdab9c566bd0c93e1e188b284d70716dddc570478588b
|
Provenance
The following attestation bundles were made for llamacpp_cli-0.1.7.tar.gz:
Publisher:
publish.yml on joeyjiaojg/llamacpp-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llamacpp_cli-0.1.7.tar.gz -
Subject digest:
374d5acbe22859fe800a93bb527be3bc1d297b8a9f014218ef9efdc97d239457 - Sigstore transparency entry: 1669711525
- Sigstore integration time:
-
Permalink:
joeyjiaojg/llamacpp-cli@4c44541f02b8e4758b301014ca1764170ce7ea4f -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/joeyjiaojg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4c44541f02b8e4758b301014ca1764170ce7ea4f -
Trigger Event:
push
-
Statement type:
File details
Details for the file llamacpp_cli-0.1.7-py3-none-any.whl.
File metadata
- Download URL: llamacpp_cli-0.1.7-py3-none-any.whl
- Upload date:
- Size: 88.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4e65865cc4340538088b1a6186840a4bc36fae6e8e2ee989146d7952ed134b4
|
|
| MD5 |
92fa9349e6a4e0f19f55dc796fee801d
|
|
| BLAKE2b-256 |
f0853e570b3a3b00172dbfcaab8f588ee9d09a32eaf3b58201d0fa194d37407e
|
Provenance
The following attestation bundles were made for llamacpp_cli-0.1.7-py3-none-any.whl:
Publisher:
publish.yml on joeyjiaojg/llamacpp-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llamacpp_cli-0.1.7-py3-none-any.whl -
Subject digest:
d4e65865cc4340538088b1a6186840a4bc36fae6e8e2ee989146d7952ed134b4 - Sigstore transparency entry: 1669711623
- Sigstore integration time:
-
Permalink:
joeyjiaojg/llamacpp-cli@4c44541f02b8e4758b301014ca1764170ce7ea4f -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/joeyjiaojg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4c44541f02b8e4758b301014ca1764170ce7ea4f -
Trigger Event:
push
-
Statement type: