Deploy GGUF models to RunPod or Replicate with one command.
Project description
infera
Deploy GGUF (llama-cpp-python) models to RunPod or Replicate with one command.
pip install infera-deploy
infera init my-project
cd my-project
cp ~/Downloads/llama.gguf models/
infera deploy runpod # or: replicate
That's it. No Dockerfile, no Cog config, no GraphQL — infera writes the runtime, builds the image, uploads the model, and registers the serverless endpoint.
Package name on PyPI is
infera-deploy; the Python module and CLI are bothinfera.
What you'll need
- Python 3.10+
- A
.ggufmodel file (e.g. from TheBloke on Hugging Face) - For RunPod: Docker daemon, RunPod API key, Docker Hub login (
docker login) - For Replicate: cog (Linux/macOS or WSL),
cog login
What infera deploy actually does
- Bundles a runtime tailored to the provider (
Dockerfile+ handler for RunPod,predict.py+cog.yamlfor Replicate) - Builds and pushes the container image
- (RunPod) Creates a network volume and uploads
.gguffiles to it — idempotent, skips unchanged models via MD5 - Registers / upserts the serverless endpoint
- Smoke-tests it and prints the URL
Re-runs are idempotent: same template, same volume, only changed bits get re-shipped.
Calling a deployed endpoint
The job input is OpenAI-ish:
{
"input": {
"messages": [{"role": "user", "content": "Hello"}],
"model": "llama",
"temperature": 0.7,
"max_tokens": 512
}
}
model is optional — it's the filename stem (e.g. llama-3.2-1b for llama-3.2-1b.gguf). If omitted, the first model alphabetically is used.
For embeddings: "endpoint": "embeddings" and "input": "text" (or a list).
For function calling / structured output: pass tools, response_format, or grammar (GBNF) the same way you would to OpenAI.
RunPod: POST https://api.runpod.ai/v2/<endpoint>/runsync with Authorization: Bearer <RUNPOD_KEY>.
Replicate: standard Replicate API. messages and tools are JSON-encoded strings (Cog limitation).
Adding a model to a deployed project
cp another.gguf models/
infera deploy runpod
Idempotent — only the new .gguf gets uploaded. Multiple models live side-by-side on the volume; pick one per request via the model field.
Provider configs
First infera deploy <provider> drops <provider>.yaml into the project root. Edit and re-deploy.
# runpod.yaml
gpu: AMPERE_16,AMPERE_24
gpu_vram_min: 8
workers_min: 0
workers_max: 1
idle_timeout: 5
datacenter: EU-RO-1
Using the engine locally (advanced)
from infera import Engine
engine = Engine("./models")
print(engine.chat([{"role": "user", "content": "Hello"}]))
Support
If infera saved you an afternoon of Dockerfile yak-shaving, consider buying me a coffee:
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file infera_deploy-0.1.0.tar.gz.
File metadata
- Download URL: infera_deploy-0.1.0.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff3bd8574c4dcd06c9f7282e2baf2e6809c43a30ce8d41f6a8f05d7f1f193190
|
|
| MD5 |
5e10fb8a27e5de26ba313fe0f64819ae
|
|
| BLAKE2b-256 |
64506126de97d08efd017f4214ddab89edc97ff654a5eaebe512d3ee3c4a5f9f
|
File details
Details for the file infera_deploy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: infera_deploy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96a2dd827c86566271781eb522f5673ddec67f800956c60a9d9108630af7867f
|
|
| MD5 |
73c214765f8cc8b2493e133f509515a6
|
|
| BLAKE2b-256 |
7a14621c4758f46633ec2f421fff40cb37be0df5e7ce69d8ad3ca1d585d10884
|