Skip to main content

One-click personal LLM deployment with coding agent + chat UI.

Project description

llm-launchpad

One-click personal LLM deployment with coding agent + chat UI.

Qwen3‑Coder GGUF on Modal (llama.cpp)

Run unsloth/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF on Modal using llama.cpp's HTTP server.

Prerequisites

  • Python 3.11+ and Modal CLI installed: pip install modal
  • Login/configure Modal: modal setup
  • Optional (if HF rate-limited/private): huggingface-cli login or set HUGGINGFACE_HUB_TOKEN

Files

  • Server entrypoint: qwen3-coder-llamacpp.py

1) Preload/download model weights (optional, recommended)

This downloads GGUF weights into a persistent Volume (llamacpp-cache).

modal run qwen3-coder-llamacpp.py

Common flags (defaults shown):

  • --preload True
  • --repo-id "unsloth/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF"
  • --quant "Q4_K_M"
  • --revision None

Example without preloading:

modal run qwen3-coder-llamacpp.py --preload False

2) Deploy the HTTP server

Builds llama.cpp with CUDA and serves an OpenAI-compatible API on port 8080.

modal deploy qwen3-coder-llamacpp.py

Notes:

  • First cold start can take many minutes; long timeouts are configured.
  • During warmup you may see 503 responses; retry after a few minutes.

Get the public URL:

  • Copy the web function URL printed by modal deploy (e.g. https://<user>--qwen3-coder-llamacpp-serve.modal.run).

Tail logs:

modal logs -f qwen3-coder-llamacpp.serve

3) Call the API

Set the server URL (replace with yours):

export SERVER_URL="https://<user>--qwen3-coder-llamacpp-serve.modal.run"

Completions endpoint:

curl -s -X POST \
  -H 'Content-Type: application/json' \
  -d '{"model": "default", "prompt": "Hello Qwen!"}' \
  "$SERVER_URL"/v1/completions

Chat completions endpoint:

curl -s -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "default",
        "messages": [
          {"role": "user", "content": "Write a Python function that reverses a string."}
        ]
      }' \
  "$SERVER_URL"/v1/chat/completions

Tuning and configuration

  • GPU type: edit GPU_CONFIG in qwen3-coder-llamacpp.py.
  • Quantization: edit QUANT (default: "Q4_K_M").
  • Server args: edit DEFAULT_SERVER_ARGS (e.g., --ctx-size, --threads).
  • If VRAM is insufficient, reduce GPU offload by lowering --n-gpu-layers or set GPU_CONFIG = None for CPU.

Volumes

  • Weights cache volume: llamacpp-cache
    • List files: modal volume ls llamacpp-cache
    • Explore: modal shell --volume llamacpp-cache (then cd /mnt)

Troubleshooting

  • Slow downloads: ensure HF_HUB_ENABLE_HF_TRANSFER=1.
  • HF auth errors: login with huggingface-cli login.
  • Build errors: ensure host CUDA >= 12.4, or switch to CPU.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_launchpad-0.0.1.tar.gz (39.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_launchpad-0.0.1-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_launchpad-0.0.1.tar.gz.

File metadata

  • Download URL: llm_launchpad-0.0.1.tar.gz
  • Upload date:
  • Size: 39.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for llm_launchpad-0.0.1.tar.gz
Algorithm Hash digest
SHA256 af487de84440fbd8a6ba9b5bd113cc67e35b27ad7fd54b5d8693a7fc4f4f490f
MD5 cc5456dbc6b4bb162063960adefa4d4e
BLAKE2b-256 7f709a05f4aef6194b2b4c7fd4961ebcab149e94a95d844d178d01161d66689b

See more details on using hashes here.

File details

Details for the file llm_launchpad-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: llm_launchpad-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for llm_launchpad-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cc065288311ecf054f641cd3f06396e1c19e40f6132ad4274310e7f666c8e24a
MD5 9686545581d3e7ea386ce5ff9daa9574
BLAKE2b-256 c2d3c9212e275142c4dcf1475c3024a19a858778a9261916dfa96b7936cd96c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page