Skip to main content

Deploy GGUF models to RunPod or Replicate with one command.

Project description

infera — deploy & chill

infera

Deploy GGUF (llama-cpp-python) models to RunPod or Replicate with one command.

pip install infera-deploy

infera init my-project
cd my-project
cp ~/Downloads/llama.gguf models/
infera deploy runpod        # or: replicate

That's it. No Dockerfile, no Cog config, no GraphQL — infera writes the runtime, builds the image, uploads the model, and registers the serverless endpoint.

Package name on PyPI is infera-deploy; the Python module and CLI are both infera.

What you'll need

  • Python 3.10+
  • A .gguf model file (e.g. from TheBloke on Hugging Face)
  • For RunPod: Docker daemon, RunPod API key, Docker Hub login (docker login)
  • For Replicate: cog (Linux/macOS or WSL), cog login

What infera deploy actually does

  1. Bundles a runtime tailored to the provider (Dockerfile + handler for RunPod, predict.py + cog.yaml for Replicate)
  2. Builds and pushes the container image
  3. (RunPod) Creates a network volume and uploads .gguf files to it — idempotent, skips unchanged models via MD5
  4. Registers / upserts the serverless endpoint
  5. Smoke-tests it and prints the URL

Re-runs are idempotent: same template, same volume, only changed bits get re-shipped.

Calling a deployed endpoint

The job input is OpenAI-ish:

{
  "input": {
    "messages":    [{"role": "user", "content": "Hello"}],
    "model":       "llama",
    "temperature": 0.7,
    "max_tokens":  512
  }
}

model is optional — it's the filename stem (e.g. llama-3.2-1b for llama-3.2-1b.gguf). If omitted, the first model alphabetically is used.

For embeddings: "endpoint": "embeddings" and "input": "text" (or a list).

For function calling / structured output: pass tools, response_format, or grammar (GBNF) the same way you would to OpenAI.

RunPod: POST https://api.runpod.ai/v2/<endpoint>/runsync with Authorization: Bearer <RUNPOD_KEY>. Replicate: standard Replicate API. messages and tools are JSON-encoded strings (Cog limitation).

Adding a model to a deployed project

cp another.gguf models/
infera deploy runpod

Idempotent — only the new .gguf gets uploaded. Multiple models live side-by-side on the volume; pick one per request via the model field.

Provider configs

First infera deploy <provider> drops <provider>.yaml into the project root. Edit and re-deploy.

# runpod.yaml
gpu:           AMPERE_16,AMPERE_24
gpu_vram_min:  8
workers_min:   0
workers_max:   1
idle_timeout:  5
datacenter:    EU-RO-1

Using the engine locally (advanced)

from infera import Engine

engine = Engine("./models")
print(engine.chat([{"role": "user", "content": "Hello"}]))

Support

If infera saved you an afternoon of Dockerfile yak-shaving, consider buying me a coffee:

Buy Me A Coffee

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infera_deploy-0.1.0.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

infera_deploy-0.1.0-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file infera_deploy-0.1.0.tar.gz.

File metadata

  • Download URL: infera_deploy-0.1.0.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for infera_deploy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ff3bd8574c4dcd06c9f7282e2baf2e6809c43a30ce8d41f6a8f05d7f1f193190
MD5 5e10fb8a27e5de26ba313fe0f64819ae
BLAKE2b-256 64506126de97d08efd017f4214ddab89edc97ff654a5eaebe512d3ee3c4a5f9f

See more details on using hashes here.

File details

Details for the file infera_deploy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: infera_deploy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for infera_deploy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 96a2dd827c86566271781eb522f5673ddec67f800956c60a9d9108630af7867f
MD5 73c214765f8cc8b2493e133f509515a6
BLAKE2b-256 7a14621c4758f46633ec2f421fff40cb37be0df5e7ce69d8ad3ca1d585d10884

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page