Free Open-Source LLM VPS — Ultra-Fast Inference on Kaggle
Project description
⚡ Endpoint VPS
Turn Kaggle's Free Tier into a Production-Grade LLM Inference Server
Deploy an OpenAI-compatible LLM inference server on Kaggle with a single command — zero cost, zero cloud bills.
Quick Start • Commands • Configuration • API Reference • Installation
uvx endpoint boot # One-shot deploy (no install)
pip install endpoint # Or install globally
endpoint init # Setup wizard
endpoint boot # Deploy the VPS
Table of Contents
- What is Endpoint?
- Features
- Quick Start
- Architecture
- Command Reference
- Configuration
- API Reference
- Environment Variables
- Project Structure
- Development
- License
What is Endpoint?
Endpoint is a Python CLI tool that provisions a free, persistent, OpenAI-compatible LLM inference server on Kaggle's infrastructure. It combines three components into one seamless workflow:
- Kaggle Notebook — Builds and runs
llama.cpp(orsd-serverfor images/video) on Kaggle's free GPU (T4 x2, P100) or CPU with automatic model loading - Cloudflare Tunnel — Exposes the server via a secure HTTPS tunnel (trycloudflare.com) with an optional Cloudflare Worker proxy for custom domains and rate limiting
- CLI — Manages the full lifecycle: boot, stop, monitor, pull models, update settings, and stream logs — all from your terminal
Key differentiator: Unlike managed API services, you retain full control — choose any GGUF model, tune every inference parameter, and pay nothing. llama.cpp runs blazing fast on Kaggle's multi-core CPUs or optional GPU accelerators.
Features
Core Platform
- Zero Cost — Runs entirely on Kaggle's free tier (CPU: unlimited hours, GPU: 30h/week, TPU: 20h/week)
- OpenAI-Compatible API — Full
/v1/chat/completions,/v1/completions,/v1/embeddings,/v1/modelswith streaming SSE - Multi-Model — Pull any GGUF model from HuggingFace (text, image, video, embeddings, voice, multimodal)
- Model Hot-Swap — Switch models at runtime without redeploying
- Interactive Boot Wizard — Guided model selection with real-time HuggingFace metadata fetching
- Persistent Storage — Models and settings survive session restarts via Kaggle datasets
Inference Capabilities
- Chat Completions — Full OpenAI API with streaming (SSE), function calling, logprobs, stop sequences
- Text Completions — Legacy completions endpoint
- Embeddings — Generate vector embeddings for RAG pipelines
- Image Generation — Stable Diffusion via sd-server (txt2img, img2img)
- Video Generation — Wan/LTX video models via sd-server
- Tokenization — Tokenize/detokenize, content moderation, reranking
- Reasoning Models — DeepSeek-R1 and reasoning-aware model support
Performance & Optimization
- llama.cpp — State-of-the-art CPU/GPU inference with
march=native+-O3compilation - Flash Attention — Reduced VRAM usage and faster context processing
- Multi-GPU — Tensor split support for T4 x2 configurations
- KV Cache — Configurable cache size, quantization, and management
- Auto-Batching — Dynamic batch size calculation based on context length
- Connection Pooling — Persistent HTTP sessions with connection reuse
Operations
- Real-Time Logs — SSE-based log streaming from the engine
- Live Status — Watch boot progress and engine health in real-time
- Idle Auto-Termination — Configurable timeout shuts down idle VPS to save resources
- Rate Limiting — Per-IP rate limiting middleware in both CLI and Cloudflare proxy
- Graceful Shutdown — SIGINT/SIGTERM handling with clean resource cleanup
- Shell Completions — Bash and Zsh tab completion for all commands and flags
Security & Reliability
- Cloudflare Proxy — Optional Worker-based proxy with KV-backed tunnel map and SHA-256 API key hashing
- API Key Auth — Bearer token authentication on all inference endpoints
- GPU/TPU Detection — Automatic compute capability detection for optimal binary selection
- Auto-Retry — Upstream proxy retries on transient network failures
- Atomic Config Writes — Crash-safe configuration file updates
Quick Start
Prerequisites
- Kaggle account (phone verified for API access)
- Kaggle API token saved to
~/.kaggle/kaggle.json:{"username":"your-username","key":"kgat_xxxxxxxxxxxxxxxx"}
- Python 3.12+
Installation
Choose your preferred method:
# Run instantly without installing (recommended for first try):
uvx endpoint boot
# Install globally with pip:
pip install endpoint
# Or with uv:
uv add endpoint
# Or with pipx (isolated environment):
pipx install endpoint
# From source:
git clone https://github.com/shesher/endpoint.git
cd endpoint
make install
Configure
# Set your Kaggle API token:
export KAGGLE_API_TOKEN='kgat_xxxx'
# Run the interactive setup wizard:
endpoint init
The wizard prompts for your Kaggle username, kernel slug, and default model selection. It creates ~/.config/endpoint/endpoint-config.yaml with all settings.
Deploy
# Interactive boot with model selection and configuration:
endpoint boot
# Boot with GPU acceleration (T4 x2, up to 70B param models):
endpoint boot --gpu
# Boot with TPU acceleration (v5e-8, up to 100B param models):
endpoint boot --tpu
# Boot without live streaming status:
endpoint boot --no-watch
Verify & Use
# Show API endpoint with example curl commands:
endpoint base-url
# Test API connectivity:
endpoint connect
# List deployed models:
endpoint models
# Stream engine logs:
endpoint logs
Stop
# Graceful shutdown with cache cleanup:
endpoint stop
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ YOUR TERMINAL │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ endpoint CLI │ │ Shell Complet│ │ Config (YAML) │ │
│ │ (commands.py)│ │ (bash/zsh) │ │ ~/.config/endpoint/ │ │
│ └──────┬───────┘ └──────────────┘ └────────────────────────┘ │
└─────────┼───────────────────────────────────────────────────────┘
│ ① boot / stop / pull / settings
▼
┌─────────────────────────────────────────────────────────────────┐
│ KAGGLE NOTEBOOK (VPS) │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ FastAPI Engine │ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
│ │ │ Chat/Embed │ │ Image Gen │ │ Video Gen │ │ │
│ │ │ /v1/chat │ │ /v1/images │ │ /v1/video │ │ │
│ │ │ /v1/embed │ │ sd-server │ │ sd-server │ │ │
│ │ └──────┬──────┘ └──────┬───────┘ └───────┬──────────┘ │ │
│ │ │ │ │ │ │
│ │ ┌──────┴────────────────┴──────────────────┴──────────┐ │ │
│ │ │ llama.cpp backend │ │ │
│ │ │ GGUF models • KV cache • Flash attn • Batching │ │ │
│ │ └─────────────────────┬───────────────────────────────┘ │ │
│ └────────────────────────┼───────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────┴────────────────────────────────┐ │
│ │ cloudflared tunnel │ │
│ │ https://xxxx.trycloudflare.com → engine:5003 │ │
│ └────────────────────────┬────────────────────────────────┘ │
└───────────────────────────┼─────────────────────────────────────┘
│ ③ proxy requests
▼
┌─────────────────────────────────────────────────────────────────┐
│ CLOUDFLARE WORKER (Optional) │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ proxy-worker.js │ │
│ │ • Rate limiting (100 req/min per IP) │ │
│ │ • KV-backed tunnel map (SHA-256 hashed API keys) │ │
│ │ • Request proxying with retry + timeout │ │
│ │ • CORS headers for browser clients │ │
│ └────────────────────────────────────────────────────────────┘ │
│ Domain: api.endpoint.dpdns.org → tunnel │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ YOUR APPLICATION │
│ OpenAI SDK / curl / etc│
└─────────────────────────┘
Signal Flow
The CLI and notebook communicate via ntfy.sh pub/sub for push-based signaling:
CLI ──boot──▶ Kaggle ──STATUS──▶ ntfy.sh ──▶ CLI (streamed in terminal)
CLI ──stop──▶ ntfy.sh ──KILL────▶ Kaggle ──shutdown──▶ ntfy.sh ──▶ CLI
CLI ◀──WS URL──────────────────── ntfy.sh ◀──base64── Kaggle (tunnel acquired)
Command Reference
All 19 commands with complete flag documentation.
Global Flags
Available before any subcommand:
| Flag | Alias | Description |
|---|---|---|
--help |
-h |
Show help message and exit |
--version |
-v |
Show version and build information |
--gpu |
-g |
GPU accelerator (T4 x2; max 70B params). Mutually exclusive with --tpu |
--tpu |
-t |
TPU v5e-8 accelerator (max 100B params). Mutually exclusive with --gpu |
endpoint boot
Deploy the LLM VPS to Kaggle — interactive model selection, configuration, and deployment.
Flags:
| Flag | Description |
|---|---|
--no-watch |
Build and push without streaming status signals |
--p100 |
Use P100 GPU accelerator instead of default T4 x2 |
Use case: Starting a fresh inference session. Run this first after configuration.
Examples:
endpoint boot # Interactive boot (recommended)
endpoint boot --gpu # Boot with T4 x2 GPU accelerator
endpoint boot --tpu # Boot with TPU v5e-8 accelerator
endpoint boot --no-watch # Boot without streaming status
endpoint boot --p100 # Boot with P100 GPU accelerator
$ endpoint boot
MODEL SELECTION
Current default: bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
Enter any HuggingFace GGUF model ID (e.g. Qwen/Qwen2.5-1.5B-Instruct-GGUF)
Model ID [Enter to keep default]: Qwen/Qwen2.5-1.5B-Instruct-GGUF
Type: text
Context: 32,768 tokens
Available GGUF files:
1) Qwen2.5-1.5B-Instruct-Q4_K_M.gguf (0.99 GB)
2) Qwen2.5-1.5B-Instruct-Q5_K_M.gguf (1.14 GB)
3) Qwen2.5-1.5B-Instruct-Q6_K.gguf (1.32 GB)
4) Qwen2.5-1.5B-Instruct-Q8_0.gguf (1.70 GB)
Enter number or filename [1]: 1
MODEL CONFIGURATION
max_tokens [2048]:
temperature [0.7]:
top_p [0.9]:
context_length [32768]:
✓ Model and configuration saved.
DEPLOYMENT
Kernel "shesher/endpoint-llm-vps" is "none"
Building notebook...
Pushing to Kaggle...
...
endpoint stop
Stop the running VPS instance — sends kill signal, deletes kernel, clears cached tunnel URL and API key.
Use case: Tear down the VPS when done to free Kaggle resources and prevent idle charges.
Flags: None.
Examples:
endpoint stop
✓ Stop signal sent.
✓ Kernel deleted.
✓ Cached credentials cleared.
endpoint status
Show VPS kernel state, tunnel URL, and deployed models using cached data with background refresh.
Use case: Quickly check whether your VPS is running, what models are deployed, and get quick-access links.
Flags: None.
Examples:
endpoint status
Kernel: running (shesher/endpoint-llm-vps)
URL: https://random-1234.trycloudflare.com
Deployed Models:
• bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
• Qwen/Qwen2.5-1.5B-Instruct-GGUF
Quick links:
endpoint base-url — Show API endpoint
endpoint logs — Stream engine logs
endpoint stop — Stop VPS
endpoint watch
Stream boot and engine status signals in real-time via ntfy.sh.
Flags: None.
Use case: Monitor deployment progress in a separate terminal while endpoint boot --no-watch runs.
Flags: None.
Examples:
endpoint watch
$ endpoint watch
WATCHING endpoint-shesher...
→ notebook-cell-execution-begins
→ Tunnel: https://random-1234.trycloudflare.com
→ Endpoint IS ONLINE
^C Stopped.
endpoint kill-all
Terminate all running Kaggle kernels. Lists active kernels, asks for confirmation, then stops them in parallel.
Use case: Clean slate — stop every running Kaggle kernel when you have multiple stale sessions, or automate cleanup in scripts.
Flags:
| Flag | Alias | Description |
|---|---|---|
--yes |
-y |
Skip interactive confirmation prompt |
--all |
— | Also delete inactive (completed/error) kernels after stopping active ones |
Examples:
endpoint kill-all # Interactive mode
endpoint kill-all --yes # Non-interactive (script-friendly)
endpoint kill-all --yes --all # Kill everything including completed
$ endpoint kill-all
Checking kernels...
Active: shesher/endpoint-llm-vps, shesher/experiment-01
Inactive: shesher/old-test
Stop these 2 active kernels? [y/N] y
✓ Stopped shesher/endpoint-llm-vps
✓ Stopped shesher/experiment-01
✓ Deleted shesher/old-test
Done. Stopped 2, errors 0.
endpoint register
Register the tunnel URL with the Cloudflare proxy. Fetches the API key from the engine and registers the tunnel so the proxy forwards requests.
Use case: After booting, if proxy registration failed or needs to be re-done (e.g., tunnel URL changed, proxy was restarted).
Flags:
| Flag | Alias | Description |
|---|---|---|
--force |
-f |
Skip the proxy health check and register anyway |
Examples:
endpoint register # Normal registration
endpoint register --force # Force-register even if proxy is unresponsive
✓ Registered tunnel with proxy.
Test with:
curl https://your-proxy.workers.dev/v1/models \
-H "Authorization: Bearer sk-xxxxxxxxxxxx"
endpoint base-url
Show the API endpoint URL with example curl commands for chat, embeddings, and model listing.
Use case: Get the URL and auth header needed to call the API from curl, Python, or any OpenAI-compatible client.
Flags: None.
Examples:
endpoint base-url
$ endpoint base-url
Endpoint: https://random-1234.trycloudflare.com
Chat Completion:
curl -X POST https://random-1234.trycloudflare.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <key>" \
-d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'
List Models:
curl https://random-1234.trycloudflare.com/v1/models \
-H "Authorization: Bearer <key>"
Embeddings:
curl -X POST https://random-1234.trycloudflare.com/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <key>" \
-d '{"model":"default","input":"Hello world"}'
endpoint connect
Test API connectivity by querying /v1/models. Prints the raw JSON response on success.
Use case: Quick health check — verify the VPS API is responding before sending inference requests.
Flags: None.
Examples:
endpoint connect
$ endpoint connect
CONNECTIVITY TEST
✓ API is responding!
{
"object": "list",
"data": [
{"id": "default", "object": "model", ...}
]
}
endpoint logs
Stream real-time VPS engine logs via Server-Sent Events (SSE). Press Ctrl+C to stop.
Use case: Debug model loading, monitor inference requests, troubleshoot errors on the VPS in real-time.
Flags: None.
Examples:
endpoint logs
$ endpoint logs
ENGINE LOGS (Ctrl+C to stop)
[2026-05-30 12:00:00] INFO Starting llama.cpp server...
[2026-05-30 12:00:05] INFO Loading model default...
[2026-05-30 12:01:00] INFO POST /v1/chat/completions 200 2.3s
^C Log stream ended.
endpoint models
List deployed models on the VPS. Without flags, queries the running VPS for its current model roster.
Use case: See what models are available for inference, verify a model was pulled successfully, or inspect model metadata.
Flags:
| Flag | Alias | Description |
|---|---|---|
--builtin |
-b |
Show built-in models from local config (no VPS query needed) |
Examples:
endpoint models # Query VPS for deployed models
endpoint models --builtin # Show models defined in config
$ endpoint models --builtin
# Name Size HuggingFace Repo
1 Qwen2.5-Coder-3B-Instruct-abliterated (default) 1.99 GB bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
2 Qwen2.5-1.5B-Instruct 0.95 GB Qwen/Qwen2.5-1.5B-Instruct-GGUF
Hint:
endpoint pull <hf-repo> Deploy a new model
endpoint boot Start the VPS with these models
endpoint pull [model]
Pull a GGUF model from HuggingFace onto the VPS. The model is downloaded in the background on the VPS.
Use case: Add a new model (e.g., a fine-tune, larger/smaller variant, or different model family) without re-deploying the VPS.
Arguments:
| Argument | Required | Description |
|---|---|---|
model |
No | HuggingFace model ID (e.g. Qwen/Qwen2.5-1.5B-Instruct-GGUF). Prompts if omitted |
Examples:
endpoint pull Qwen/Qwen2.5-1.5B-Instruct-GGUF
endpoint pull # Interactive prompt
endpoint remove [model]
Remove a deployed model from the VPS to free disk space.
Use case: Free up storage on the VPS by removing unused models, or replace a model with a different quantization.
Arguments:
| Argument | Required | Description |
|---|---|---|
model |
No | Model ID to remove. Prompts if omitted |
Examples:
endpoint remove Qwen/Qwen2.5-1.5B-Instruct-GGUF
Removing model: Qwen/Qwen2.5-1.5B-Instruct-GGUF
✓ Model removed.
endpoint settings [action] [key] [value]
View or update engine parameters on the running VPS.
Use case: Tweak inference parameters (temperature, context length, threads) without restarting the VPS, or inspect current configuration.
Arguments:
| Action | Required | Description |
|---|---|---|
| (none) | — | View all current settings (default when no arguments given) |
update |
Yes | Change a setting. Requires key and value arguments |
Usage:
endpoint settings # View all current settings
endpoint settings update <key> <value> # Update a specific setting
Supported settings keys:
| Key | Type | Description |
|---|---|---|
context_length |
int | Context window size in tokens |
batch_size |
int | Batch size |
threads |
int | CPU inference threads |
temperature |
float | Sampling temperature (0.0–2.0) |
top_p |
float | Nucleus sampling threshold (0.0–1.0) |
top_k |
int | Top-K sampling |
repeat_penalty |
float | Repetition penalty (1.0 = none) |
min_p |
float | Min-P sampling threshold |
typical_p |
float | Typical sampling threshold |
flash_attn |
bool | Enable flash attention |
max_tokens |
int | Maximum tokens to generate |
ngl |
int | GPU layers to offload (0=CPU, 999=max) |
mlock |
bool | Lock model in physical RAM |
no_mmap |
bool | Disable memory-mapped model loading |
cache_size |
int | KV cache size |
ignore_eos |
bool | Ignore end-of-sequence token |
seed |
int | Random seed for reproducibility |
presence_penalty |
float | Presence penalty (-2.0–2.0) |
frequency_penalty |
float | Frequency penalty (-2.0–2.0) |
Examples:
endpoint settings update temperature 0.3
endpoint settings update flash_attn true
endpoint settings update context_length 16384
endpoint settings update threads 8
$ endpoint settings
SETTINGS
{
"context_length": 8192,
"batch_size": 512,
"temperature": 0.7,
"top_p": 0.9,
...
}
Use: endpoint settings update <key> <value>
endpoint init
Interactive setup wizard for first-time configuration. Generates ~/.config/endpoint/endpoint-config.yaml.
Use case: First-time setup — creates the configuration file with your Kaggle username, kernel slug, and default model selection.
Flags: None.
Examples:
endpoint init
$ endpoint init
ENDPOINT SETUP WIZARD
Kaggle username [shesher]:
Kernel slug [endpoint-llm-vps]:
Default model:
1) Qwen2.5-Coder-3B-Instruct-abliterated (default, 1.99 GB)
2) Qwen2.5-1.5B-Instruct (0.95 GB)
Enter number or HF repo ID [1]: 2
✓ Configuration saved to ~/.config/endpoint/endpoint-config.yaml
Next steps:
endpoint doctor — Check dependencies
endpoint boot — Deploy the VPS
endpoint doctor
Run system diagnostics. Checks for required CLI tools (kaggle, curl, jq, git), Python modules, Kaggle API token validity, config status, and proxy health.
Use case: Verify your environment is ready before deploying, or diagnose issues when something isn't working.
Flags: None.
Examples:
endpoint doctor
$ endpoint doctor
CLI Tools:
✓ kaggle found
✓ curl found
✓ jq found
✓ git found
Python:
✓ yaml module available
Kaggle API:
✓ Token found
✓ API working (1 kernel listed)
Config:
✓ Username: shesher
✓ Kernel ID: shesher/endpoint-llm-vps
Proxy:
✓ Proxy healthy — 200 OK
✓ All checks passed.
endpoint provider-config
Show Myth Org provider identity, live API key, and deployed models from the VPS with metadata (context length, type, reasoning capability). Optionally filter by model name and see usage examples.
Use case: Inspect provider details for integration with OpenAI-compatible clients, or get ready-to-use curl/Python examples for a specific model.
Flags:
| Flag | Alias | Description |
|---|---|---|
--model |
-m |
Filter displayed models (exact, prefix, or substring match) |
Examples:
endpoint provider-config # Show all provider info
endpoint provider-config --model Qwen # Filter to Qwen models
endpoint provider-config -m "Qwen2.5-Coder" # Exact match with examples
PROVIDER CONFIGURATION
Provider ID: myth
Provider Name: MYTH Org
Base URL: https://random-1234.trycloudflare.com
API Key: sk-xxxxxxxxxxxx
Deployed Models:
Model ID Display Name Context Type Reasoning
Qwen2.5-Coder-3B-Instruct-... Qwen 2.5 Coder 3B Abliterated 32K text no
Usage Examples (curl, Python):
curl -X POST https://random-1234.trycloudflare.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxxxx" \
-d '{"model":"Qwen2.5-Coder-3B-Instruct-...","messages":[{"role":"user","content":"Hello"}]}'
endpoint autocomplete [shell]
Install shell completion for bash or zsh. Appends the completion function to ~/.bashrc or ~/.zshrc.
Use case: Enable tab-completion for all endpoint commands, subcommands, and flags in your shell.
Arguments:
| Argument | Required | Default | Description |
|---|---|---|---|
shell |
No | bash |
Shell type: bash or zsh |
Examples:
endpoint autocomplete # Install bash completions (default)
source ~/.bashrc # Reload to activate
endpoint autocomplete zsh # Install zsh completions
source ~/.zshrc # Reload to activate
✓ Bash completion installed. Restart your shell or run: source ~/.bashrc
endpoint help [command]
Show help for a specific command. Without arguments, prints the general help listing all commands.
Use case: Quick reference for a command's flags and usage without checking the full documentation.
Arguments:
| Argument | Required | Description |
|---|---|---|
command |
No | Command name to get help for |
Examples:
endpoint help # General help
endpoint help boot # Help for boot command
endpoint help kill-all # Help for kill-all
endpoint help settings # Help for settings
$ endpoint help boot
usage: endpoint boot [-h] [--no-watch] [--p100]
Deploy LLM VPS to Kaggle and stream status
options:
-h, --help show this help message and exit
--no-watch Build and push without streaming
--p100 Use P100 GPU (default T4 x2)
endpoint version
Show version and build information.
Use case: Verify which version of Endpoint you have installed, confirm the active config path, and check the kernel name.
Flags: None.
Examples:
endpoint version
$ endpoint version
Endpoint CLI v0.1.0
License: GPLv3+
Engine version: 0.1.0
Config: /home/user/.config/endpoint/endpoint-config.yaml
Kernel: shesher/endpoint-llm-vps
Configuration
Endpoint uses a YAML configuration file loaded with layered precedence:
- Bundled defaults —
endpoint/data/endpoint-config.yaml - User config —
~/.config/endpoint/endpoint-config.yaml - Local override —
./endpoint-config.yaml(project root, gitignored)
Full Schema
# ── Identity ──────────────────────────────────────────────────────
identity:
kaggle_username: "your-username" # Kaggle username
kernel_slug: "endpoint-llm-vps" # Kaggle kernel slug
api_key: "" # API key (auto-provisioned on boot)
# ── Engine ────────────────────────────────────────────────────────
engine:
version: "0.1.0" # Engine version
engine_port: 5003 # FastAPI server port
accelerator: "cpu" # cpu | gpu_t4 | gpu_p100 | tpu
# ── Signal (ntfy.sh) ─────────────────────────────────────────────
signal:
topic_prefix: "endpoint" # ntfy.sh topic prefix
# ── LLM Inference ─────────────────────────────────────────────────
llm:
port: 8080 # llama.cpp server port
context_length: 8192 # Context window (tokens)
batch_size: 512 # Batch size
threads: 4 # CPU threads
flash_attn: true # Flash attention
max_tokens: 2048 # Max generation tokens
temperature: 0.7 # Sampling temperature
top_p: 0.9 # Nucleus sampling
top_k: 40 # Top-K sampling
repeat_penalty: 1.1 # Repetition penalty
min_p: 0.0 # Min-P sampling
typical_p: 0.0 # Typical sampling
ngl: 0 # GPU layers (0=CPU, 999=max)
tensor_split: "" # Multi-GPU split: "1,1"
mlock: false # Lock model in RAM
no_mmap: false # Disable memory mapping
cache_size: 0 # KV cache size (0=auto)
chat_template: "" # Jinja2 template override
ignore_eos: false # Ignore EOS token
# ── Image Generation (sd-server) ─────────────────────────────────
image:
port: 8081 # sd-server port
steps: 20 # Denoising steps
cfg_scale: 7.0 # Guidance scale
width: 1024 # Output width
height: 1024 # Output height
sampler: "euler" # Sampler type
# ── Video Generation ─────────────────────────────────────────────
video:
port: 8081 # sd-server port (shared)
fps: 12 # Frames per second
frames: 41 # Total frames
cfg_scale: 5.0 # Video CFG scale
steps: 20 # Denoising steps
# ── Models ────────────────────────────────────────────────────────
default_model: "bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF"
default_model_file: "*.Q4_K_M.gguf"
default_model_index: 0
models:
- name: "Qwen2.5-Coder-3B-Instruct-abliterated"
hf_repo: "bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF"
hf_file: "*.Q4_K_M.gguf"
size_gb: 1.99
temperature: 0.7
top_p: 0.9
top_k: 40
- name: "Qwen2.5-1.5B-Instruct"
hf_repo: "Qwen/Qwen2.5-1.5B-Instruct-GGUF"
hf_file: "Qwen2.5-1.5B-Instruct-Q4_K_M.gguf"
size_gb: 0.95
temperature: 0.7
top_p: 0.9
top_k: 40
# ── Proxy ─────────────────────────────────────────────────────────
proxy:
enabled: true
domain: "api.endpoint.dpdns.org"
url: "https://api.endpoint.dpdns.org"
base_url: "https://api.endpoint.dpdns.org/v1"
register_url: "https://api.endpoint.dpdns.org/__register"
unregister_url: "https://api.endpoint.dpdns.org/__unregister"
# ── VPS Packages ─────────────────────────────────────────────────
packages:
system:
- curl
- wget
- tar
- jq
- python3-dev
- pip
python:
- huggingface_hub
- requests
- fastapi
- uvicorn[standard]
- pydantic
Accelerator Reference
| Accelerator | CLI Flag | Kaggle Type | Max Model Size | Quota |
|---|---|---|---|---|
| CPU | (default) | CPU only | 20B params | Unlimited |
| GPU T4 x2 | --gpu / -g |
NvidiaTeslaT4 |
70B params | 30h/week |
| GPU P100 | --p100 |
NvidiaTeslaP100 |
40B params | 30h/week |
| TPU v5e-8 | --tpu / -t |
TpuV5E8 |
100B params | 20h/week |
API Reference
Once deployed, the engine exposes a full OpenAI-compatible REST API.
Public Endpoints (no auth required)
| Method | Path | Description |
|---|---|---|
GET |
/ |
Root status: service info, docs link |
GET |
/health |
Health check: backend status, model loaded, uptime |
GET |
/tunnel |
Tunnel URL and status |
GET |
/metrics |
Request count, latency, error rate, latency buckets |
GET |
/docs |
Swagger UI documentation |
GET |
/openapi.json |
OpenAPI 3.0 schema |
GET |
/v1/apikey |
Get the current API key |
Authenticated Endpoints (Bearer token required)
Chat & Text
| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
Chat completions with streaming SSE support |
POST |
/v1/completions |
Legacy text completions |
Chat Request supports all OpenAI parameters: model, messages, stream, temperature, top_p, max_tokens, stop, frequency_penalty, presence_penalty, logprobs, top_logprobs, seed, n, user, and reasoning_effort.
Models
| Method | Path | Description |
|---|---|---|
GET |
/v1/models |
List all models with metadata (type, context, size, reasoning) |
GET |
/v1/models/{model_id} |
Get specific model details |
POST |
/v1/models/pull?model=<id> |
Pull model from HuggingFace |
DELETE |
/v1/models/{model_id} |
Remove a model |
Embeddings & Rerank
| Method | Path | Description |
|---|---|---|
POST |
/v1/embeddings |
Generate text embeddings |
POST |
/v1/rerank |
Re-rank documents by relevance |
Image & Video
| Method | Path | Description |
|---|---|---|
POST |
/v1/images/generations |
Generate images (txt2img) |
POST |
/v1/video/generations |
Generate video |
POST |
/v1/video/edits |
Edit video |
Tokenization & Moderation
| Method | Path | Description |
|---|---|---|
POST |
/v1/tokenize |
Tokenize text |
POST |
/v1/detokenize |
Detokenize token IDs |
POST |
/v1/moderations |
Content moderation |
Settings & Logs
| Method | Path | Description |
|---|---|---|
GET |
/v1/settings |
Get current engine settings |
POST |
/v1/settings |
Update engine settings |
GET |
/v1/logs |
Get recent log lines |
GET |
/v1/logs/stream |
SSE stream of live logs |
Usage Examples
# Chat with streaming
curl -X POST https://your-tunnel.trycloudflare.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxx" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
from openai import OpenAI
client = OpenAI(
base_url="https://your-tunnel.trycloudflare.com/v1",
api_key="sk-xxxx",
)
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")
# Generate embeddings
response = client.embeddings.create(
model="default",
input=["Hello world", "How are you?"],
)
print(response.data[0].embedding[:5]) # First 5 dimensions
Environment Variables
CLI Variables
| Variable | Required | Default | Description |
|---|---|---|---|
KAGGLE_API_TOKEN |
Yes* | — | Kaggle API token (kgat_...). Alternative to kaggle.json |
ENDPOINT_API_KEY |
Auto-set | — | Engine API key (set by CLI during boot) |
CONFIG |
No | — | Path to YAML config file override |
*Required if ~/.kaggle/kaggle.json is not present.
Engine Variables (set on the Kaggle VPS)
| Variable | Default | Description |
|---|---|---|
LLAMA_PORT |
8080 |
llama.cpp server port |
SD_PORT |
8081 |
sd-server port |
ENGINE_PORT |
5003 |
FastAPI engine port |
LLM_SIGNAL_TOPIC |
"" |
ntfy.sh signal topic (auto-derived) |
LLM_CONTROL_TOPIC |
"" |
ntfy.sh control topic (auto-derived) |
SESSION_ID |
"default" |
Session identifier |
ENGINE_VERSION |
"0.1.0" |
Engine version string |
LLM_ACCELERATOR |
— | Accelerator type (cpu, gpu_t4, etc.) |
LLM_IDLE_TIMEOUT |
3600 |
Idle shutdown timeout in seconds (0 = disable) |
RATE_LIMIT |
"1" |
Enable rate limiting (1/true/yes) |
LLM_DEBUG |
— | Enable verbose error messages |
LLM_CONTEXT_LEN |
— | Context length override |
LLM_BATCH_SIZE |
— | Batch size override |
LLM_MAX_TOKENS |
— | Max tokens override |
LLM_NGL |
— | GPU layers override |
LLM_TENSOR_SPLIT |
— | Tensor split override |
DEFAULT_MODEL |
— | Default model ID override |
DEFAULT_MODEL_FILE |
— | Default model filename override |
MODEL_TYPE |
— | Model type override |
HF_TOKEN |
— | HuggingFace token for gated model downloads |
Project Structure
endpoint/
├── endpoint/ # CLI package (pip-installable)
│ ├── __init__.py # Version string
│ ├── __main__.py # python -m endpoint support
│ ├── main.py # Argparse parser, dispatch
│ ├── commands.py # All 19 command implementations
│ ├── core.py # Config, VPSClient, console, signals, cache
│ ├── py.typed # PEP 561 type marker
│ └── data/ # Bundled defaults
│ ├── endpoint-config.yaml
│ └── endpoint-config.example.yaml
├── engine/ # Inference server (runs on Kaggle)
│ ├── engine.py # FastAPI app: all API endpoints, lifecycle
│ └── models_config.py # Model management, GGUF parsing, settings
├── scripts/ # Build & automation
│ ├── master_build_notebook.py # Kaggle notebook generator (962 lines)
│ ├── update_and_embed.py # Engine payload sync into notebook
│ ├── lint.py # Code quality pipeline (ruff + mypy + pytest)
│ └── release.py # PyPI and GitHub release automation
├── cloudflare/ # Cloudflare proxy worker
│ ├── deploy.py # Wrangler-based deployment orchestration
│ ├── proxy-worker.js # CF Worker: rate limiting, KV tunnel map
│ ├── wrangler.toml # Wrangler configuration
│ └── wrangler.example.toml # Example wrangler config
├── tests/
│ └── sanity_test.py # 26-test static analysis suite
├── man/
│ └── endpoint.1 # Man page (roff format)
├── Makefile # Build, lint, test, publish targets
├── pyproject.toml # Package metadata + tool configuration
├── uv.lock # Dependency lock file
├── endpoint-config.yaml # Local config (gitignored)
└── README.md
Development
Setup
git clone https://github.com/shesher/endpoint.git
cd endpoint
make install # or: uv sync
Commands
make all # Full pipeline: lint + test (default target)
make lint # Run ruff linter (format check + lint)
make test # Run pytest test suite
make build # Build PyPI wheel + sdist
make install # Editable install with uv
make release # Dry-run release build
make publish # Build & publish to PyPI
make clean # Remove build artifacts
Quality Standards
- Linting: ruff with full ruleset (E, F, I, N, W, UP, B, SIM, ARG, etc.)
- Formatting: ruff formatter (compatible with Black)
- Type Checking: mypy for CLI, engine, scripts, and tests
- Testing: pytest with 26 static analysis tests covering syntax, security, naming, config schema, base64 roundtrip, version consistency, import resolution, license headers, file sizes, and more
- CI:
make allruns the full pipeline: lint → test
License
GNU General Public License v3.0 or later © 2024-2026 Shesher Hasan.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file endpoint_vps-0.1.0.tar.gz.
File metadata
- Download URL: endpoint_vps-0.1.0.tar.gz
- Upload date:
- Size: 144.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Kali GNU/Linux","version":"2026.2","id":"kali-rolling","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e59a43ebb84e4a9b679364924c13d8978c4e0105b3c3ea5bba465ccad26bbce0
|
|
| MD5 |
10c0325e00127432be4d22a309390129
|
|
| BLAKE2b-256 |
248d3b53639955583315d088a373603db04ea0dcb30c605968527a0f1458ab3b
|
File details
Details for the file endpoint_vps-0.1.0-py3-none-any.whl.
File metadata
- Download URL: endpoint_vps-0.1.0-py3-none-any.whl
- Upload date:
- Size: 135.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Kali GNU/Linux","version":"2026.2","id":"kali-rolling","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
479ce947eefd267638735e1fcdd39a1bbfd9621311b0b5f4e63906815cfab015
|
|
| MD5 |
d65281a2086b1bc0c51c6729617c0f5b
|
|
| BLAKE2b-256 |
c956a55ab836b3a7c07d74833fbe504b8c3d1f8da5ebe7fad3e14ef4043ddefe
|