Skip to main content

Free Open-Source LLM VPS — Ultra-Fast Inference on Kaggle

Project description

⚡ Endpoint VPS

Turn Kaggle's Free Tier into a Production-Grade LLM Inference Server

Version License Python Platform PyPI Code Style Type Checked

Deploy an OpenAI-compatible LLM inference server on Kaggle with a single command — zero cost, zero cloud bills.

Quick StartCommandsConfigurationAPI ReferenceInstallation


uvx endpoint boot           # One-shot deploy (no install)
pip install endpoint        # Or install globally
endpoint init               # Setup wizard
endpoint boot               # Deploy the VPS

Table of Contents


What is Endpoint?

Endpoint is a Python CLI tool that provisions a free, persistent, OpenAI-compatible LLM inference server on Kaggle's infrastructure. It combines three components into one seamless workflow:

  1. Kaggle Notebook — Builds and runs llama.cpp (or sd-server for images/video) on Kaggle's free GPU (T4 x2, P100) or CPU with automatic model loading
  2. Cloudflare Tunnel — Exposes the server via a secure HTTPS tunnel (trycloudflare.com) with an optional Cloudflare Worker proxy for custom domains and rate limiting
  3. CLI — Manages the full lifecycle: boot, stop, monitor, pull models, update settings, and stream logs — all from your terminal

Key differentiator: Unlike managed API services, you retain full control — choose any GGUF model, tune every inference parameter, and pay nothing. llama.cpp runs blazing fast on Kaggle's multi-core CPUs or optional GPU accelerators.


Features

Core Platform

  • Zero Cost — Runs entirely on Kaggle's free tier (CPU: unlimited hours, GPU: 30h/week, TPU: 20h/week)
  • OpenAI-Compatible API — Full /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models with streaming SSE
  • Multi-Model — Pull any GGUF model from HuggingFace (text, image, video, embeddings, voice, multimodal)
  • Model Hot-Swap — Switch models at runtime without redeploying
  • Interactive Boot Wizard — Guided model selection with real-time HuggingFace metadata fetching
  • Persistent Storage — Models and settings survive session restarts via Kaggle datasets

Inference Capabilities

  • Chat Completions — Full OpenAI API with streaming (SSE), function calling, logprobs, stop sequences
  • Text Completions — Legacy completions endpoint
  • Embeddings — Generate vector embeddings for RAG pipelines
  • Image Generation — Stable Diffusion via sd-server (txt2img, img2img)
  • Video Generation — Wan/LTX video models via sd-server
  • Tokenization — Tokenize/detokenize, content moderation, reranking
  • Reasoning Models — DeepSeek-R1 and reasoning-aware model support

Performance & Optimization

  • llama.cpp — State-of-the-art CPU/GPU inference with march=native + -O3 compilation
  • Flash Attention — Reduced VRAM usage and faster context processing
  • Multi-GPU — Tensor split support for T4 x2 configurations
  • KV Cache — Configurable cache size, quantization, and management
  • Auto-Batching — Dynamic batch size calculation based on context length
  • Connection Pooling — Persistent HTTP sessions with connection reuse

Operations

  • Real-Time Logs — SSE-based log streaming from the engine
  • Live Status — Watch boot progress and engine health in real-time
  • Idle Auto-Termination — Configurable timeout shuts down idle VPS to save resources
  • Rate Limiting — Per-IP rate limiting middleware in both CLI and Cloudflare proxy
  • Graceful Shutdown — SIGINT/SIGTERM handling with clean resource cleanup
  • Shell Completions — Bash and Zsh tab completion for all commands and flags

Security & Reliability

  • Cloudflare Proxy — Optional Worker-based proxy with KV-backed tunnel map and SHA-256 API key hashing
  • API Key Auth — Bearer token authentication on all inference endpoints
  • GPU/TPU Detection — Automatic compute capability detection for optimal binary selection
  • Auto-Retry — Upstream proxy retries on transient network failures
  • Atomic Config Writes — Crash-safe configuration file updates

Quick Start

Prerequisites

  • Kaggle account (phone verified for API access)
  • Kaggle API token saved to ~/.kaggle/kaggle.json:
    {"username":"your-username","key":"kgat_xxxxxxxxxxxxxxxx"}
    
  • Python 3.12+

Installation

Choose your preferred method:

# Run instantly without installing (recommended for first try):
uvx endpoint boot

# Install globally with pip:
pip install endpoint

# Or with uv:
uv add endpoint

# Or with pipx (isolated environment):
pipx install endpoint

# From source:
git clone https://github.com/shesher/endpoint.git
cd endpoint
make install

Configure

# Set your Kaggle API token:
export KAGGLE_API_TOKEN='kgat_xxxx'

# Run the interactive setup wizard:
endpoint init

The wizard prompts for your Kaggle username, kernel slug, and default model selection. It creates ~/.config/endpoint/endpoint-config.yaml with all settings.

Deploy

# Interactive boot with model selection and configuration:
endpoint boot

# Boot with GPU acceleration (T4 x2, up to 70B param models):
endpoint boot --gpu

# Boot with TPU acceleration (v5e-8, up to 100B param models):
endpoint boot --tpu

# Boot without live streaming status:
endpoint boot --no-watch

Verify & Use

# Show API endpoint with example curl commands:
endpoint base-url

# Test API connectivity:
endpoint connect

# List deployed models:
endpoint models

# Stream engine logs:
endpoint logs

Stop

# Graceful shutdown with cache cleanup:
endpoint stop

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        YOUR TERMINAL                            │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────┐ │
│  │ endpoint CLI │  │ Shell Complet│  │ Config (YAML)          │ │
│  │ (commands.py)│  │ (bash/zsh)   │  │ ~/.config/endpoint/    │ │
│  └──────┬───────┘  └──────────────┘  └────────────────────────┘ │
└─────────┼───────────────────────────────────────────────────────┘
          │ ① boot / stop / pull / settings
          ▼
┌─────────────────────────────────────────────────────────────────┐
│                    KAGGLE NOTEBOOK (VPS)                        │
│                                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    FastAPI Engine                          │ │
│  │  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐   │ │
│  │  │ Chat/Embed  │  │ Image Gen    │  │ Video Gen        │   │ │
│  │  │ /v1/chat    │  │ /v1/images   │  │ /v1/video        │   │ │
│  │  │ /v1/embed   │  │ sd-server    │  │ sd-server        │   │ │
│  │  └──────┬──────┘  └──────┬───────┘  └───────┬──────────┘   │ │
│  │         │                │                  │              │ │
│  │  ┌──────┴────────────────┴──────────────────┴──────────┐   │ │
│  │  │               llama.cpp backend                     │   │ │
│  │  │  GGUF models • KV cache • Flash attn • Batching     │   │ │
│  │  └─────────────────────┬───────────────────────────────┘   │ │
│  └────────────────────────┼───────────────────────────────────┘ │
│                           │                                     │
│  ┌────────────────────────┴────────────────────────────────┐    │
│  │              cloudflared tunnel                         │    │
│  │  https://xxxx.trycloudflare.com → engine:5003           │    │
│  └────────────────────────┬────────────────────────────────┘    │
└───────────────────────────┼─────────────────────────────────────┘
                            │ ③ proxy requests
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│                  CLOUDFLARE WORKER (Optional)                   │
│                                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  proxy-worker.js                                           │ │
│  │  • Rate limiting (100 req/min per IP)                      │ │
│  │  • KV-backed tunnel map (SHA-256 hashed API keys)          │ │
│  │  • Request proxying with retry + timeout                   │ │
│  │  • CORS headers for browser clients                        │ │
│  └────────────────────────────────────────────────────────────┘ │
│  Domain: api.endpoint.dpdns.org → tunnel                        │
└─────────────────────────────────────────────────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │  YOUR APPLICATION       │
              │  OpenAI SDK / curl / etc│
              └─────────────────────────┘

Signal Flow

The CLI and notebook communicate via ntfy.sh pub/sub for push-based signaling:

CLI ──boot──▶ Kaggle ──STATUS──▶ ntfy.sh ──▶ CLI (streamed in terminal)
CLI ──stop──▶ ntfy.sh ──KILL────▶ Kaggle ──shutdown──▶ ntfy.sh ──▶ CLI
CLI ◀──WS URL──────────────────── ntfy.sh ◀──base64── Kaggle (tunnel acquired)

Command Reference

All 19 commands with complete flag documentation.

Global Flags

Available before any subcommand:

Flag Alias Description
--help -h Show help message and exit
--version -v Show version and build information
--gpu -g GPU accelerator (T4 x2; max 70B params). Mutually exclusive with --tpu
--tpu -t TPU v5e-8 accelerator (max 100B params). Mutually exclusive with --gpu

endpoint boot

Deploy the LLM VPS to Kaggle — interactive model selection, configuration, and deployment.

Flags:

Flag Description
--no-watch Build and push without streaming status signals
--p100 Use P100 GPU accelerator instead of default T4 x2

Use case: Starting a fresh inference session. Run this first after configuration.

Examples:

endpoint boot                  # Interactive boot (recommended)
endpoint boot --gpu            # Boot with T4 x2 GPU accelerator
endpoint boot --tpu            # Boot with TPU v5e-8 accelerator
endpoint boot --no-watch       # Boot without streaming status
endpoint boot --p100           # Boot with P100 GPU accelerator
$ endpoint boot

MODEL SELECTION
  Current default: bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
  Enter any HuggingFace GGUF model ID (e.g. Qwen/Qwen2.5-1.5B-Instruct-GGUF)

  Model ID [Enter to keep default]: Qwen/Qwen2.5-1.5B-Instruct-GGUF
  Type:         text
  Context:      32,768 tokens

  Available GGUF files:
    1) Qwen2.5-1.5B-Instruct-Q4_K_M.gguf (0.99 GB)
    2) Qwen2.5-1.5B-Instruct-Q5_K_M.gguf (1.14 GB)
    3) Qwen2.5-1.5B-Instruct-Q6_K.gguf (1.32 GB)
    4) Qwen2.5-1.5B-Instruct-Q8_0.gguf (1.70 GB)

  Enter number or filename [1]: 1

MODEL CONFIGURATION
  max_tokens [2048]:
  temperature [0.7]:
  top_p [0.9]:
  context_length [32768]:

✓ Model and configuration saved.
DEPLOYMENT
  Kernel "shesher/endpoint-llm-vps" is "none"
  Building notebook...
  Pushing to Kaggle...
  ...

endpoint stop

Stop the running VPS instance — sends kill signal, deletes kernel, clears cached tunnel URL and API key.

Use case: Tear down the VPS when done to free Kaggle resources and prevent idle charges.

Flags: None.

Examples:

endpoint stop
✓ Stop signal sent.
✓ Kernel deleted.
✓ Cached credentials cleared.

endpoint status

Show VPS kernel state, tunnel URL, and deployed models using cached data with background refresh.

Use case: Quickly check whether your VPS is running, what models are deployed, and get quick-access links.

Flags: None.

Examples:

endpoint status
Kernel: running (shesher/endpoint-llm-vps)
  URL: https://random-1234.trycloudflare.com

Deployed Models:
  • bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
  • Qwen/Qwen2.5-1.5B-Instruct-GGUF

Quick links:
  endpoint base-url    — Show API endpoint
  endpoint logs        — Stream engine logs
  endpoint stop        — Stop VPS

endpoint watch

Stream boot and engine status signals in real-time via ntfy.sh.

Flags: None.

Use case: Monitor deployment progress in a separate terminal while endpoint boot --no-watch runs.

Flags: None.

Examples:

endpoint watch
$ endpoint watch
WATCHING endpoint-shesher...
  → notebook-cell-execution-begins
  → Tunnel: https://random-1234.trycloudflare.com
  → Endpoint IS ONLINE
^C Stopped.

endpoint kill-all

Terminate all running Kaggle kernels. Lists active kernels, asks for confirmation, then stops them in parallel.

Use case: Clean slate — stop every running Kaggle kernel when you have multiple stale sessions, or automate cleanup in scripts.

Flags:

Flag Alias Description
--yes -y Skip interactive confirmation prompt
--all Also delete inactive (completed/error) kernels after stopping active ones

Examples:

endpoint kill-all                    # Interactive mode
endpoint kill-all --yes              # Non-interactive (script-friendly)
endpoint kill-all --yes --all        # Kill everything including completed
$ endpoint kill-all
Checking kernels...
  Active: shesher/endpoint-llm-vps, shesher/experiment-01
  Inactive: shesher/old-test

Stop these 2 active kernels? [y/N] y
  ✓ Stopped shesher/endpoint-llm-vps
  ✓ Stopped shesher/experiment-01
  ✓ Deleted shesher/old-test
Done. Stopped 2, errors 0.

endpoint register

Register the tunnel URL with the Cloudflare proxy. Fetches the API key from the engine and registers the tunnel so the proxy forwards requests.

Use case: After booting, if proxy registration failed or needs to be re-done (e.g., tunnel URL changed, proxy was restarted).

Flags:

Flag Alias Description
--force -f Skip the proxy health check and register anyway

Examples:

endpoint register                    # Normal registration
endpoint register --force            # Force-register even if proxy is unresponsive
✓ Registered tunnel with proxy.
Test with:
  curl https://your-proxy.workers.dev/v1/models \
    -H "Authorization: Bearer sk-xxxxxxxxxxxx"

endpoint base-url

Show the API endpoint URL with example curl commands for chat, embeddings, and model listing.

Use case: Get the URL and auth header needed to call the API from curl, Python, or any OpenAI-compatible client.

Flags: None.

Examples:

endpoint base-url
$ endpoint base-url
Endpoint: https://random-1234.trycloudflare.com

Chat Completion:
  curl -X POST https://random-1234.trycloudflare.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <key>" \
    -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'

List Models:
  curl https://random-1234.trycloudflare.com/v1/models \
    -H "Authorization: Bearer <key>"

Embeddings:
  curl -X POST https://random-1234.trycloudflare.com/v1/embeddings \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <key>" \
    -d '{"model":"default","input":"Hello world"}'

endpoint connect

Test API connectivity by querying /v1/models. Prints the raw JSON response on success.

Use case: Quick health check — verify the VPS API is responding before sending inference requests.

Flags: None.

Examples:

endpoint connect
$ endpoint connect
CONNECTIVITY TEST
✓ API is responding!
{
  "object": "list",
  "data": [
    {"id": "default", "object": "model", ...}
  ]
}

endpoint logs

Stream real-time VPS engine logs via Server-Sent Events (SSE). Press Ctrl+C to stop.

Use case: Debug model loading, monitor inference requests, troubleshoot errors on the VPS in real-time.

Flags: None.

Examples:

endpoint logs
$ endpoint logs
ENGINE LOGS (Ctrl+C to stop)
[2026-05-30 12:00:00] INFO     Starting llama.cpp server...
[2026-05-30 12:00:05] INFO     Loading model default...
[2026-05-30 12:01:00] INFO     POST /v1/chat/completions 200 2.3s
^C Log stream ended.

endpoint models

List deployed models on the VPS. Without flags, queries the running VPS for its current model roster.

Use case: See what models are available for inference, verify a model was pulled successfully, or inspect model metadata.

Flags:

Flag Alias Description
--builtin -b Show built-in models from local config (no VPS query needed)

Examples:

endpoint models                      # Query VPS for deployed models
endpoint models --builtin            # Show models defined in config
$ endpoint models --builtin
#   Name                                             Size    HuggingFace Repo
1   Qwen2.5-Coder-3B-Instruct-abliterated (default)  1.99 GB bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
2   Qwen2.5-1.5B-Instruct                            0.95 GB Qwen/Qwen2.5-1.5B-Instruct-GGUF

Hint:
  endpoint pull <hf-repo>   Deploy a new model
  endpoint boot             Start the VPS with these models

endpoint pull [model]

Pull a GGUF model from HuggingFace onto the VPS. The model is downloaded in the background on the VPS.

Use case: Add a new model (e.g., a fine-tune, larger/smaller variant, or different model family) without re-deploying the VPS.

Arguments:

Argument Required Description
model No HuggingFace model ID (e.g. Qwen/Qwen2.5-1.5B-Instruct-GGUF). Prompts if omitted

Examples:

endpoint pull Qwen/Qwen2.5-1.5B-Instruct-GGUF
endpoint pull                          # Interactive prompt

endpoint remove [model]

Remove a deployed model from the VPS to free disk space.

Use case: Free up storage on the VPS by removing unused models, or replace a model with a different quantization.

Arguments:

Argument Required Description
model No Model ID to remove. Prompts if omitted

Examples:

endpoint remove Qwen/Qwen2.5-1.5B-Instruct-GGUF
Removing model: Qwen/Qwen2.5-1.5B-Instruct-GGUF
✓ Model removed.

endpoint settings [action] [key] [value]

View or update engine parameters on the running VPS.

Use case: Tweak inference parameters (temperature, context length, threads) without restarting the VPS, or inspect current configuration.

Arguments:

Action Required Description
(none) View all current settings (default when no arguments given)
update Yes Change a setting. Requires key and value arguments

Usage:

endpoint settings                          # View all current settings
endpoint settings update <key> <value>     # Update a specific setting

Supported settings keys:

Key Type Description
context_length int Context window size in tokens
batch_size int Batch size
threads int CPU inference threads
temperature float Sampling temperature (0.0–2.0)
top_p float Nucleus sampling threshold (0.0–1.0)
top_k int Top-K sampling
repeat_penalty float Repetition penalty (1.0 = none)
min_p float Min-P sampling threshold
typical_p float Typical sampling threshold
flash_attn bool Enable flash attention
max_tokens int Maximum tokens to generate
ngl int GPU layers to offload (0=CPU, 999=max)
mlock bool Lock model in physical RAM
no_mmap bool Disable memory-mapped model loading
cache_size int KV cache size
ignore_eos bool Ignore end-of-sequence token
seed int Random seed for reproducibility
presence_penalty float Presence penalty (-2.0–2.0)
frequency_penalty float Frequency penalty (-2.0–2.0)

Examples:

endpoint settings update temperature 0.3
endpoint settings update flash_attn true
endpoint settings update context_length 16384
endpoint settings update threads 8
$ endpoint settings
SETTINGS
{
  "context_length": 8192,
  "batch_size": 512,
  "temperature": 0.7,
  "top_p": 0.9,
  ...
}
Use: endpoint settings update <key> <value>

endpoint init

Interactive setup wizard for first-time configuration. Generates ~/.config/endpoint/endpoint-config.yaml.

Use case: First-time setup — creates the configuration file with your Kaggle username, kernel slug, and default model selection.

Flags: None.

Examples:

endpoint init
$ endpoint init
ENDPOINT SETUP WIZARD

Kaggle username [shesher]:
Kernel slug [endpoint-llm-vps]:

Default model:
  1) Qwen2.5-Coder-3B-Instruct-abliterated (default, 1.99 GB)
  2) Qwen2.5-1.5B-Instruct (0.95 GB)
Enter number or HF repo ID [1]: 2

✓ Configuration saved to ~/.config/endpoint/endpoint-config.yaml

Next steps:
  endpoint doctor       — Check dependencies
  endpoint boot         — Deploy the VPS

endpoint doctor

Run system diagnostics. Checks for required CLI tools (kaggle, curl, jq, git), Python modules, Kaggle API token validity, config status, and proxy health.

Use case: Verify your environment is ready before deploying, or diagnose issues when something isn't working.

Flags: None.

Examples:

endpoint doctor
$ endpoint doctor
CLI Tools:
  ✓ kaggle     found
  ✓ curl       found
  ✓ jq         found
  ✓ git        found

Python:
  ✓ yaml       module available

Kaggle API:
  ✓ Token found
  ✓ API working (1 kernel listed)

Config:
  ✓ Username:   shesher
  ✓ Kernel ID:  shesher/endpoint-llm-vps

Proxy:
  ✓ Proxy healthy — 200 OK

✓ All checks passed.

endpoint provider-config

Show Myth Org provider identity, live API key, and deployed models from the VPS with metadata (context length, type, reasoning capability). Optionally filter by model name and see usage examples.

Use case: Inspect provider details for integration with OpenAI-compatible clients, or get ready-to-use curl/Python examples for a specific model.

Flags:

Flag Alias Description
--model -m Filter displayed models (exact, prefix, or substring match)

Examples:

endpoint provider-config                         # Show all provider info
endpoint provider-config --model Qwen            # Filter to Qwen models
endpoint provider-config -m "Qwen2.5-Coder"     # Exact match with examples
PROVIDER CONFIGURATION
  Provider ID:     myth
  Provider Name:   MYTH Org
  Base URL:        https://random-1234.trycloudflare.com
  API Key:         sk-xxxxxxxxxxxx

Deployed Models:
  Model ID                        Display Name                      Context    Type    Reasoning
  Qwen2.5-Coder-3B-Instruct-...   Qwen 2.5 Coder 3B Abliterated     32K        text    no

Usage Examples (curl, Python):
  curl -X POST https://random-1234.trycloudflare.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer sk-xxxxxxxxxxxx" \
    -d '{"model":"Qwen2.5-Coder-3B-Instruct-...","messages":[{"role":"user","content":"Hello"}]}'

endpoint autocomplete [shell]

Install shell completion for bash or zsh. Appends the completion function to ~/.bashrc or ~/.zshrc.

Use case: Enable tab-completion for all endpoint commands, subcommands, and flags in your shell.

Arguments:

Argument Required Default Description
shell No bash Shell type: bash or zsh

Examples:

endpoint autocomplete                # Install bash completions (default)
source ~/.bashrc                     # Reload to activate

endpoint autocomplete zsh            # Install zsh completions
source ~/.zshrc                      # Reload to activate
✓ Bash completion installed. Restart your shell or run: source ~/.bashrc

endpoint help [command]

Show help for a specific command. Without arguments, prints the general help listing all commands.

Use case: Quick reference for a command's flags and usage without checking the full documentation.

Arguments:

Argument Required Description
command No Command name to get help for

Examples:

endpoint help                        # General help
endpoint help boot                   # Help for boot command
endpoint help kill-all               # Help for kill-all
endpoint help settings               # Help for settings
$ endpoint help boot
usage: endpoint boot [-h] [--no-watch] [--p100]

Deploy LLM VPS to Kaggle and stream status

options:
  -h, --help   show this help message and exit
  --no-watch   Build and push without streaming
  --p100       Use P100 GPU (default T4 x2)

endpoint version

Show version and build information.

Use case: Verify which version of Endpoint you have installed, confirm the active config path, and check the kernel name.

Flags: None.

Examples:

endpoint version
$ endpoint version
Endpoint CLI v0.1.0
License: GPLv3+
Engine version: 0.1.0
Config: /home/user/.config/endpoint/endpoint-config.yaml
Kernel: shesher/endpoint-llm-vps

Configuration

Endpoint uses a YAML configuration file loaded with layered precedence:

  1. Bundled defaultsendpoint/data/endpoint-config.yaml
  2. User config~/.config/endpoint/endpoint-config.yaml
  3. Local override./endpoint-config.yaml (project root, gitignored)

Full Schema

# ── Identity ──────────────────────────────────────────────────────
identity:
  kaggle_username: "your-username"         # Kaggle username
  kernel_slug: "endpoint-llm-vps"          # Kaggle kernel slug
  api_key: ""                              # API key (auto-provisioned on boot)

# ── Engine ────────────────────────────────────────────────────────
engine:
  version: "0.1.0"                         # Engine version
  engine_port: 5003                        # FastAPI server port
  accelerator: "cpu"                       # cpu | gpu_t4 | gpu_p100 | tpu

# ── Signal (ntfy.sh) ─────────────────────────────────────────────
signal:
  topic_prefix: "endpoint"                 # ntfy.sh topic prefix

# ── LLM Inference ─────────────────────────────────────────────────
llm:
  port: 8080                               # llama.cpp server port
  context_length: 8192                     # Context window (tokens)
  batch_size: 512                          # Batch size
  threads: 4                               # CPU threads
  flash_attn: true                         # Flash attention
  max_tokens: 2048                         # Max generation tokens
  temperature: 0.7                         # Sampling temperature
  top_p: 0.9                               # Nucleus sampling
  top_k: 40                                # Top-K sampling
  repeat_penalty: 1.1                      # Repetition penalty
  min_p: 0.0                               # Min-P sampling
  typical_p: 0.0                           # Typical sampling
  ngl: 0                                   # GPU layers (0=CPU, 999=max)
  tensor_split: ""                         # Multi-GPU split: "1,1"
  mlock: false                             # Lock model in RAM
  no_mmap: false                           # Disable memory mapping
  cache_size: 0                            # KV cache size (0=auto)
  chat_template: ""                        # Jinja2 template override
  ignore_eos: false                        # Ignore EOS token

# ── Image Generation (sd-server) ─────────────────────────────────
image:
  port: 8081                               # sd-server port
  steps: 20                                # Denoising steps
  cfg_scale: 7.0                           # Guidance scale
  width: 1024                              # Output width
  height: 1024                             # Output height
  sampler: "euler"                         # Sampler type

# ── Video Generation ─────────────────────────────────────────────
video:
  port: 8081                               # sd-server port (shared)
  fps: 12                                  # Frames per second
  frames: 41                               # Total frames
  cfg_scale: 5.0                           # Video CFG scale
  steps: 20                                # Denoising steps

# ── Models ────────────────────────────────────────────────────────
default_model: "bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF"
default_model_file: "*.Q4_K_M.gguf"
default_model_index: 0

models:
  - name: "Qwen2.5-Coder-3B-Instruct-abliterated"
    hf_repo: "bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF"
    hf_file: "*.Q4_K_M.gguf"
    size_gb: 1.99
    temperature: 0.7
    top_p: 0.9
    top_k: 40
  - name: "Qwen2.5-1.5B-Instruct"
    hf_repo: "Qwen/Qwen2.5-1.5B-Instruct-GGUF"
    hf_file: "Qwen2.5-1.5B-Instruct-Q4_K_M.gguf"
    size_gb: 0.95
    temperature: 0.7
    top_p: 0.9
    top_k: 40

# ── Proxy ─────────────────────────────────────────────────────────
proxy:
  enabled: true
  domain: "api.endpoint.dpdns.org"
  url: "https://api.endpoint.dpdns.org"
  base_url: "https://api.endpoint.dpdns.org/v1"
  register_url: "https://api.endpoint.dpdns.org/__register"
  unregister_url: "https://api.endpoint.dpdns.org/__unregister"

# ── VPS Packages ─────────────────────────────────────────────────
packages:
  system:
    - curl
    - wget
    - tar
    - jq
    - python3-dev
    - pip
  python:
    - huggingface_hub
    - requests
    - fastapi
    - uvicorn[standard]
    - pydantic

Accelerator Reference

Accelerator CLI Flag Kaggle Type Max Model Size Quota
CPU (default) CPU only 20B params Unlimited
GPU T4 x2 --gpu / -g NvidiaTeslaT4 70B params 30h/week
GPU P100 --p100 NvidiaTeslaP100 40B params 30h/week
TPU v5e-8 --tpu / -t TpuV5E8 100B params 20h/week

API Reference

Once deployed, the engine exposes a full OpenAI-compatible REST API.

Public Endpoints (no auth required)

Method Path Description
GET / Root status: service info, docs link
GET /health Health check: backend status, model loaded, uptime
GET /tunnel Tunnel URL and status
GET /metrics Request count, latency, error rate, latency buckets
GET /docs Swagger UI documentation
GET /openapi.json OpenAPI 3.0 schema
GET /v1/apikey Get the current API key

Authenticated Endpoints (Bearer token required)

Chat & Text

Method Path Description
POST /v1/chat/completions Chat completions with streaming SSE support
POST /v1/completions Legacy text completions

Chat Request supports all OpenAI parameters: model, messages, stream, temperature, top_p, max_tokens, stop, frequency_penalty, presence_penalty, logprobs, top_logprobs, seed, n, user, and reasoning_effort.

Models

Method Path Description
GET /v1/models List all models with metadata (type, context, size, reasoning)
GET /v1/models/{model_id} Get specific model details
POST /v1/models/pull?model=<id> Pull model from HuggingFace
DELETE /v1/models/{model_id} Remove a model

Embeddings & Rerank

Method Path Description
POST /v1/embeddings Generate text embeddings
POST /v1/rerank Re-rank documents by relevance

Image & Video

Method Path Description
POST /v1/images/generations Generate images (txt2img)
POST /v1/video/generations Generate video
POST /v1/video/edits Edit video

Tokenization & Moderation

Method Path Description
POST /v1/tokenize Tokenize text
POST /v1/detokenize Detokenize token IDs
POST /v1/moderations Content moderation

Settings & Logs

Method Path Description
GET /v1/settings Get current engine settings
POST /v1/settings Update engine settings
GET /v1/logs Get recent log lines
GET /v1/logs/stream SSE stream of live logs

Usage Examples

# Chat with streaming
curl -X POST https://your-tunnel.trycloudflare.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxx" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'
from openai import OpenAI

client = OpenAI(
    base_url="https://your-tunnel.trycloudflare.com/v1",
    api_key="sk-xxxx",
)
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")
# Generate embeddings
response = client.embeddings.create(
    model="default",
    input=["Hello world", "How are you?"],
)
print(response.data[0].embedding[:5])  # First 5 dimensions

Environment Variables

CLI Variables

Variable Required Default Description
KAGGLE_API_TOKEN Yes* Kaggle API token (kgat_...). Alternative to kaggle.json
ENDPOINT_API_KEY Auto-set Engine API key (set by CLI during boot)
CONFIG No Path to YAML config file override

*Required if ~/.kaggle/kaggle.json is not present.

Engine Variables (set on the Kaggle VPS)

Variable Default Description
LLAMA_PORT 8080 llama.cpp server port
SD_PORT 8081 sd-server port
ENGINE_PORT 5003 FastAPI engine port
LLM_SIGNAL_TOPIC "" ntfy.sh signal topic (auto-derived)
LLM_CONTROL_TOPIC "" ntfy.sh control topic (auto-derived)
SESSION_ID "default" Session identifier
ENGINE_VERSION "0.1.0" Engine version string
LLM_ACCELERATOR Accelerator type (cpu, gpu_t4, etc.)
LLM_IDLE_TIMEOUT 3600 Idle shutdown timeout in seconds (0 = disable)
RATE_LIMIT "1" Enable rate limiting (1/true/yes)
LLM_DEBUG Enable verbose error messages
LLM_CONTEXT_LEN Context length override
LLM_BATCH_SIZE Batch size override
LLM_MAX_TOKENS Max tokens override
LLM_NGL GPU layers override
LLM_TENSOR_SPLIT Tensor split override
DEFAULT_MODEL Default model ID override
DEFAULT_MODEL_FILE Default model filename override
MODEL_TYPE Model type override
HF_TOKEN HuggingFace token for gated model downloads

Project Structure

endpoint/
├── endpoint/                        # CLI package (pip-installable)
│   ├── __init__.py                  # Version string
│   ├── __main__.py                  # python -m endpoint support
│   ├── main.py                      # Argparse parser, dispatch
│   ├── commands.py                  # All 19 command implementations
│   ├── core.py                      # Config, VPSClient, console, signals, cache
│   ├── py.typed                     # PEP 561 type marker
│   └── data/                        # Bundled defaults
│       ├── endpoint-config.yaml
│       └── endpoint-config.example.yaml
├── engine/                          # Inference server (runs on Kaggle)
│   ├── engine.py                    # FastAPI app: all API endpoints, lifecycle
│   └── models_config.py             # Model management, GGUF parsing, settings
├── scripts/                         # Build & automation
│   ├── master_build_notebook.py     # Kaggle notebook generator (962 lines)
│   ├── update_and_embed.py          # Engine payload sync into notebook
│   ├── lint.py                      # Code quality pipeline (ruff + mypy + pytest)
│   └── release.py                   # PyPI and GitHub release automation
├── cloudflare/                      # Cloudflare proxy worker
│   ├── deploy.py                    # Wrangler-based deployment orchestration
│   ├── proxy-worker.js              # CF Worker: rate limiting, KV tunnel map
│   ├── wrangler.toml                # Wrangler configuration
│   └── wrangler.example.toml        # Example wrangler config
├── tests/
│   └── sanity_test.py               # 26-test static analysis suite
├── man/
│   └── endpoint.1                   # Man page (roff format)
├── Makefile                         # Build, lint, test, publish targets
├── pyproject.toml                   # Package metadata + tool configuration
├── uv.lock                          # Dependency lock file
├── endpoint-config.yaml             # Local config (gitignored)
└── README.md

Development

Setup

git clone https://github.com/shesher/endpoint.git
cd endpoint
make install      # or: uv sync

Commands

make all          # Full pipeline: lint + test (default target)
make lint         # Run ruff linter (format check + lint)
make test         # Run pytest test suite
make build        # Build PyPI wheel + sdist
make install      # Editable install with uv
make release      # Dry-run release build
make publish      # Build & publish to PyPI
make clean        # Remove build artifacts

Quality Standards

  • Linting: ruff with full ruleset (E, F, I, N, W, UP, B, SIM, ARG, etc.)
  • Formatting: ruff formatter (compatible with Black)
  • Type Checking: mypy for CLI, engine, scripts, and tests
  • Testing: pytest with 26 static analysis tests covering syntax, security, naming, config schema, base64 roundtrip, version consistency, import resolution, license headers, file sizes, and more
  • CI: make all runs the full pipeline: lint → test

License

GNU General Public License v3.0 or later © 2024-2026 Shesher Hasan.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

endpoint_vps-0.1.0.tar.gz (144.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

endpoint_vps-0.1.0-py3-none-any.whl (135.0 kB view details)

Uploaded Python 3

File details

Details for the file endpoint_vps-0.1.0.tar.gz.

File metadata

  • Download URL: endpoint_vps-0.1.0.tar.gz
  • Upload date:
  • Size: 144.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Kali GNU/Linux","version":"2026.2","id":"kali-rolling","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for endpoint_vps-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e59a43ebb84e4a9b679364924c13d8978c4e0105b3c3ea5bba465ccad26bbce0
MD5 10c0325e00127432be4d22a309390129
BLAKE2b-256 248d3b53639955583315d088a373603db04ea0dcb30c605968527a0f1458ab3b

See more details on using hashes here.

File details

Details for the file endpoint_vps-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: endpoint_vps-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 135.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Kali GNU/Linux","version":"2026.2","id":"kali-rolling","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for endpoint_vps-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 479ce947eefd267638735e1fcdd39a1bbfd9621311b0b5f4e63906815cfab015
MD5 d65281a2086b1bc0c51c6729617c0f5b
BLAKE2b-256 c956a55ab836b3a7c07d74833fbe504b8c3d1f8da5ebe7fad3e14ef4043ddefe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page