One-click personal LLM deployment with coding agent + chat UI.
Project description
llm-launchpad
One-click personal LLM deployment with coding agent + chat UI.
Qwen3‑Coder GGUF on Modal (llama.cpp)
Run unsloth/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF on Modal using llama.cpp's HTTP server.
Prerequisites
- Python 3.11+ and Modal CLI installed:
pip install modal - Login/configure Modal:
modal setup - Optional (if HF rate-limited/private):
huggingface-cli loginor setHUGGINGFACE_HUB_TOKEN
Files
- Server entrypoint:
qwen3-coder-llamacpp.py
1) Preload/download model weights (optional, recommended)
This downloads GGUF weights into a persistent Volume (llamacpp-cache).
modal run qwen3-coder-llamacpp.py
Common flags (defaults shown):
--preload True--repo-id "unsloth/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF"--quant "Q4_K_M"--revision None
Example without preloading:
modal run qwen3-coder-llamacpp.py --preload False
2) Deploy the HTTP server
Builds llama.cpp with CUDA and serves an OpenAI-compatible API on port 8080.
modal deploy qwen3-coder-llamacpp.py
Notes:
- First cold start can take many minutes; long timeouts are configured.
- During warmup you may see 503 responses; retry after a few minutes.
Get the public URL:
- Copy the web function URL printed by
modal deploy(e.g.https://<user>--qwen3-coder-llamacpp-serve.modal.run).
Tail logs:
modal logs -f qwen3-coder-llamacpp.serve
3) Call the API
Set the server URL (replace with yours):
export SERVER_URL="https://<user>--qwen3-coder-llamacpp-serve.modal.run"
Completions endpoint:
curl -s -X POST \
-H 'Content-Type: application/json' \
-d '{"model": "default", "prompt": "Hello Qwen!"}' \
"$SERVER_URL"/v1/completions
Chat completions endpoint:
curl -s -X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "default",
"messages": [
{"role": "user", "content": "Write a Python function that reverses a string."}
]
}' \
"$SERVER_URL"/v1/chat/completions
Tuning and configuration
- GPU type: edit
GPU_CONFIGinqwen3-coder-llamacpp.py. - Quantization: edit
QUANT(default:"Q4_K_M"). - Server args: edit
DEFAULT_SERVER_ARGS(e.g.,--ctx-size,--threads). - If VRAM is insufficient, reduce GPU offload by lowering
--n-gpu-layersor setGPU_CONFIG = Nonefor CPU.
Volumes
- Weights cache volume:
llamacpp-cache- List files:
modal volume ls llamacpp-cache - Explore:
modal shell --volume llamacpp-cache(thencd /mnt)
- List files:
Troubleshooting
- Slow downloads: ensure
HF_HUB_ENABLE_HF_TRANSFER=1. - HF auth errors: login with
huggingface-cli login. - Build errors: ensure host CUDA >= 12.4, or switch to CPU.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_launchpad-0.0.1.tar.gz.
File metadata
- Download URL: llm_launchpad-0.0.1.tar.gz
- Upload date:
- Size: 39.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af487de84440fbd8a6ba9b5bd113cc67e35b27ad7fd54b5d8693a7fc4f4f490f
|
|
| MD5 |
cc5456dbc6b4bb162063960adefa4d4e
|
|
| BLAKE2b-256 |
7f709a05f4aef6194b2b4c7fd4961ebcab149e94a95d844d178d01161d66689b
|
File details
Details for the file llm_launchpad-0.0.1-py3-none-any.whl.
File metadata
- Download URL: llm_launchpad-0.0.1-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc065288311ecf054f641cd3f06396e1c19e40f6132ad4274310e7f666c8e24a
|
|
| MD5 |
9686545581d3e7ea386ce5ff9daa9574
|
|
| BLAKE2b-256 |
c2d3c9212e275142c4dcf1475c3024a19a858778a9261916dfa96b7936cd96c0
|