Free Open-Source LLM VPS — Ultra-Fast Inference on Kaggle

These details have not been verified by PyPI

Project links

Project description

⚡ Endpoint VPS

Turn Kaggle's Free Tier into a Production-Grade LLM Inference Server

Deploy an OpenAI-compatible LLM inference server on Kaggle with a single command — zero cost, zero cloud bills.

Quick Start • Commands • Configuration • API Reference • Installation

uvx endpoint boot           # One-shot deploy (no install)
pip install endpoint        # Or install globally
endpoint init               # Setup wizard
endpoint boot               # Deploy the VPS

What is Endpoint?
Features
Quick Start
Architecture
Command Reference
Configuration
API Reference
Environment Variables
Project Structure
Development
License

What is Endpoint?

Endpoint is a Python CLI tool that provisions a free, persistent, OpenAI-compatible LLM inference server on Kaggle's infrastructure. It combines three components into one seamless workflow:

Kaggle Notebook — Builds and runs llama.cpp (or sd-server for images/video) on Kaggle's free GPU (T4 x2, P100) or CPU with automatic model loading
Cloudflare Tunnel — Exposes the server via a secure HTTPS tunnel (trycloudflare.com) with an optional Cloudflare Worker proxy for custom domains and rate limiting
CLI — Manages the full lifecycle: boot, stop, monitor, pull models, update settings, and stream logs — all from your terminal

Key differentiator: Unlike managed API services, you retain full control — choose any GGUF model, tune every inference parameter, and pay nothing. llama.cpp runs blazing fast on Kaggle's multi-core CPUs or optional GPU accelerators.

Features

Core Platform

Zero Cost — Runs entirely on Kaggle's free tier (CPU: unlimited hours, GPU: 30h/week, TPU: 20h/week)
OpenAI-Compatible API — Full /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models with streaming SSE
Multi-Model — Pull any GGUF model from HuggingFace (text, image, video, embeddings, voice, multimodal)
Model Hot-Swap — Switch models at runtime without redeploying
Interactive Boot Wizard — Guided model selection with real-time HuggingFace metadata fetching
Persistent Storage — Models and settings survive session restarts via Kaggle datasets

Inference Capabilities

Chat Completions — Full OpenAI API with streaming (SSE), function calling, logprobs, stop sequences
Text Completions — Legacy completions endpoint
Embeddings — Generate vector embeddings for RAG pipelines
Image Generation — Stable Diffusion via sd-server (txt2img, img2img)
Video Generation — Wan/LTX video models via sd-server
Tokenization — Tokenize/detokenize, content moderation, reranking
Reasoning Models — DeepSeek-R1 and reasoning-aware model support

Performance & Optimization

llama.cpp — State-of-the-art CPU/GPU inference with march=native + -O3 compilation
Flash Attention — Reduced VRAM usage and faster context processing
Multi-GPU — Tensor split support for T4 x2 configurations
KV Cache — Configurable cache size, quantization, and management
Auto-Batching — Dynamic batch size calculation based on context length
Connection Pooling — Persistent HTTP sessions with connection reuse

Operations

Real-Time Logs — SSE-based log streaming from the engine
Live Status — Watch boot progress and engine health in real-time
Idle Auto-Termination — Configurable timeout shuts down idle VPS to save resources
Rate Limiting — Per-IP rate limiting middleware in both CLI and Cloudflare proxy
Graceful Shutdown — SIGINT/SIGTERM handling with clean resource cleanup
Shell Completions — Bash and Zsh tab completion for all commands and flags

Security & Reliability

Cloudflare Proxy — Optional Worker-based proxy with KV-backed tunnel map and SHA-256 API key hashing
API Key Auth — Bearer token authentication on all inference endpoints
GPU/TPU Detection — Automatic compute capability detection for optimal binary selection
Auto-Retry — Upstream proxy retries on transient network failures
Atomic Config Writes — Crash-safe configuration file updates

Quick Start

Prerequisites

Kaggle account (phone verified for API access)

Kaggle API token saved to ~/.kaggle/kaggle.json:

{"username":"your-username","key":"kgat_xxxxxxxxxxxxxxxx"}

Python 3.12+

Installation

Choose your preferred method:

# Run instantly without installing (recommended for first try):
uvx endpoint boot

# Install globally with pip:
pip install endpoint

# Or with uv:
uv add endpoint

# Or with pipx (isolated environment):
pipx install endpoint

# From source:
git clone https://github.com/shesher/endpoint.git
cd endpoint
make install

Configure

# Set your Kaggle API token:
export KAGGLE_API_TOKEN='kgat_xxxx'

# Run the interactive setup wizard:
endpoint init

The wizard prompts for your Kaggle username, kernel slug, and default model selection. It creates ~/.config/endpoint/endpoint-config.yaml with all settings.

Deploy

# Interactive boot with model selection and configuration:
endpoint boot

# Boot with GPU acceleration (T4 x2, up to 70B param models):
endpoint boot --gpu

# Boot with TPU acceleration (v5e-8, up to 100B param models):
endpoint boot --tpu

# Boot without live streaming status:
endpoint boot --no-watch

Verify & Use

# Show API endpoint with example curl commands:
endpoint base-url

# Test API connectivity:
endpoint connect

# List deployed models:
endpoint models

# Stream engine logs:
endpoint logs

Stop

# Graceful shutdown with cache cleanup:
endpoint stop

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        YOUR TERMINAL                            │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────┐ │
│  │ endpoint CLI │  │ Shell Complet│  │ Config (YAML)          │ │
│  │ (commands.py)│  │ (bash/zsh)   │  │ ~/.config/endpoint/    │ │
│  └──────┬───────┘  └──────────────┘  └────────────────────────┘ │
└─────────┼───────────────────────────────────────────────────────┘
          │ ① boot / stop / pull / settings
          ▼
┌─────────────────────────────────────────────────────────────────┐
│                    KAGGLE NOTEBOOK (VPS)                        │
│                                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    FastAPI Engine                          │ │
│  │  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐   │ │
│  │  │ Chat/Embed  │  │ Image Gen    │  │ Video Gen        │   │ │
│  │  │ /v1/chat    │  │ /v1/images   │  │ /v1/video        │   │ │
│  │  │ /v1/embed   │  │ sd-server    │  │ sd-server        │   │ │
│  │  └──────┬──────┘  └──────┬───────┘  └───────┬──────────┘   │ │
│  │         │                │                  │              │ │
│  │  ┌──────┴────────────────┴──────────────────┴──────────┐   │ │
│  │  │               llama.cpp backend                     │   │ │
│  │  │  GGUF models • KV cache • Flash attn • Batching     │   │ │
│  │  └─────────────────────┬───────────────────────────────┘   │ │
│  └────────────────────────┼───────────────────────────────────┘ │
│                           │                                     │
│  ┌────────────────────────┴────────────────────────────────┐    │
│  │              cloudflared tunnel                         │    │
│  │  https://xxxx.trycloudflare.com → engine:5003           │    │
│  └────────────────────────┬────────────────────────────────┘    │
└───────────────────────────┼─────────────────────────────────────┘
                            │ ③ proxy requests
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│                  CLOUDFLARE WORKER (Optional)                   │
│                                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  proxy-worker.js                                           │ │
│  │  • Rate limiting (100 req/min per IP)                      │ │
│  │  • KV-backed tunnel map (SHA-256 hashed API keys)          │ │
│  │  • Request proxying with retry + timeout                   │ │
│  │  • CORS headers for browser clients                        │ │
│  └────────────────────────────────────────────────────────────┘ │
│  Domain: api.endpoint.dpdns.org → tunnel                        │
└─────────────────────────────────────────────────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │  YOUR APPLICATION       │
              │  OpenAI SDK / curl / etc│
              └─────────────────────────┘

Signal Flow

The CLI and notebook communicate via ntfy.sh pub/sub for push-based signaling:

CLI ──boot──▶ Kaggle ──STATUS──▶ ntfy.sh ──▶ CLI (streamed in terminal)
CLI ──stop──▶ ntfy.sh ──KILL────▶ Kaggle ──shutdown──▶ ntfy.sh ──▶ CLI
CLI ◀──WS URL──────────────────── ntfy.sh ◀──base64── Kaggle (tunnel acquired)

Command Reference

All 19 commands with complete flag documentation.

Global Flags

Available before any subcommand:

Flag	Alias	Description
`--help`	`-h`	Show help message and exit
`--version`	`-v`	Show version and build information
`--gpu`	`-g`	GPU accelerator (T4 x2; max 70B params). Mutually exclusive with `--tpu`
`--tpu`	`-t`	TPU v5e-8 accelerator (max 100B params). Mutually exclusive with `--gpu`

`endpoint boot`

Deploy the LLM VPS to Kaggle — interactive model selection, configuration, and deployment.

Flags:

Flag	Description
`--no-watch`	Build and push without streaming status signals
`--p100`	Use P100 GPU accelerator instead of default T4 x2

Use case: Starting a fresh inference session. Run this first after configuration.

Examples:

endpoint boot                  # Interactive boot (recommended)
endpoint boot --gpu            # Boot with T4 x2 GPU accelerator
endpoint boot --tpu            # Boot with TPU v5e-8 accelerator
endpoint boot --no-watch       # Boot without streaming status
endpoint boot --p100           # Boot with P100 GPU accelerator

$ endpoint boot

MODEL SELECTION
  Current default: bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
  Enter any HuggingFace GGUF model ID (e.g. Qwen/Qwen2.5-1.5B-Instruct-GGUF)

  Model ID [Enter to keep default]: Qwen/Qwen2.5-1.5B-Instruct-GGUF
  Type:         text
  Context:      32,768 tokens

  Available GGUF files:
    1) Qwen2.5-1.5B-Instruct-Q4_K_M.gguf (0.99 GB)
    2) Qwen2.5-1.5B-Instruct-Q5_K_M.gguf (1.14 GB)
    3) Qwen2.5-1.5B-Instruct-Q6_K.gguf (1.32 GB)
    4) Qwen2.5-1.5B-Instruct-Q8_0.gguf (1.70 GB)

  Enter number or filename [1]: 1

MODEL CONFIGURATION
  max_tokens [2048]:
  temperature [0.7]:
  top_p [0.9]:
  context_length [32768]:

✓ Model and configuration saved.
DEPLOYMENT
  Kernel "shesher/endpoint-llm-vps" is "none"
  Building notebook...
  Pushing to Kaggle...
  ...

`endpoint stop`

Stop the running VPS instance — sends kill signal, deletes kernel, clears cached tunnel URL and API key.

Use case: Tear down the VPS when done to free Kaggle resources and prevent idle charges.

Flags: None.

Examples:

endpoint stop

✓ Stop signal sent.
✓ Kernel deleted.
✓ Cached credentials cleared.

`endpoint status`

Show VPS kernel state, tunnel URL, and deployed models using cached data with background refresh.

Use case: Quickly check whether your VPS is running, what models are deployed, and get quick-access links.

Flags: None.

Examples:

endpoint status

Kernel: running (shesher/endpoint-llm-vps)
  URL: https://random-1234.trycloudflare.com

Deployed Models:
  • bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
  • Qwen/Qwen2.5-1.5B-Instruct-GGUF

Quick links:
  endpoint base-url    — Show API endpoint
  endpoint logs        — Stream engine logs
  endpoint stop        — Stop VPS

`endpoint watch`

Stream boot and engine status signals in real-time via ntfy.sh.

Flags: None.

Use case: Monitor deployment progress in a separate terminal while endpoint boot --no-watch runs.

Flags: None.

Examples:

endpoint watch

$ endpoint watch
WATCHING endpoint-shesher...
  → notebook-cell-execution-begins
  → Tunnel: https://random-1234.trycloudflare.com
  → Endpoint IS ONLINE
^C Stopped.

`endpoint kill-all`

Terminate all running Kaggle kernels. Lists active kernels, asks for confirmation, then stops them in parallel.

Use case: Clean slate — stop every running Kaggle kernel when you have multiple stale sessions, or automate cleanup in scripts.

Flags:

Flag	Alias	Description
`--yes`	`-y`	Skip interactive confirmation prompt
`--all`	—	Also delete inactive (completed/error) kernels after stopping active ones

Examples:

endpoint kill-all                    # Interactive mode
endpoint kill-all --yes              # Non-interactive (script-friendly)
endpoint kill-all --yes --all        # Kill everything including completed

$ endpoint kill-all
Checking kernels...
  Active: shesher/endpoint-llm-vps, shesher/experiment-01
  Inactive: shesher/old-test

Stop these 2 active kernels? [y/N] y
  ✓ Stopped shesher/endpoint-llm-vps
  ✓ Stopped shesher/experiment-01
  ✓ Deleted shesher/old-test
Done. Stopped 2, errors 0.

`endpoint register`

Register the tunnel URL with the Cloudflare proxy. Fetches the API key from the engine and registers the tunnel so the proxy forwards requests.

Use case: After booting, if proxy registration failed or needs to be re-done (e.g., tunnel URL changed, proxy was restarted).

Flags:

Flag	Alias	Description
`--force`	`-f`	Skip the proxy health check and register anyway

Examples:

endpoint register                    # Normal registration
endpoint register --force            # Force-register even if proxy is unresponsive

✓ Registered tunnel with proxy.
Test with:
  curl https://your-proxy.workers.dev/v1/models \
    -H "Authorization: Bearer sk-xxxxxxxxxxxx"

`endpoint base-url`

Show the API endpoint URL with example curl commands for chat, embeddings, and model listing.

Use case: Get the URL and auth header needed to call the API from curl, Python, or any OpenAI-compatible client.

Flags: None.

Examples:

endpoint base-url

$ endpoint base-url
Endpoint: https://random-1234.trycloudflare.com

Chat Completion:
  curl -X POST https://random-1234.trycloudflare.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <key>" \
    -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'

List Models:
  curl https://random-1234.trycloudflare.com/v1/models \
    -H "Authorization: Bearer <key>"

Embeddings:
  curl -X POST https://random-1234.trycloudflare.com/v1/embeddings \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <key>" \
    -d '{"model":"default","input":"Hello world"}'

`endpoint connect`

Test API connectivity by querying /v1/models. Prints the raw JSON response on success.

Use case: Quick health check — verify the VPS API is responding before sending inference requests.

Flags: None.

Examples:

endpoint connect

$ endpoint connect
CONNECTIVITY TEST
✓ API is responding!
{
  "object": "list",
  "data": [
    {"id": "default", "object": "model", ...}
  ]
}

`endpoint logs`

Stream real-time VPS engine logs via Server-Sent Events (SSE). Press Ctrl+C to stop.

Use case: Debug model loading, monitor inference requests, troubleshoot errors on the VPS in real-time.

Flags: None.

Examples:

endpoint logs

$ endpoint logs
ENGINE LOGS (Ctrl+C to stop)
[2026-05-30 12:00:00] INFO     Starting llama.cpp server...
[2026-05-30 12:00:05] INFO     Loading model default...
[2026-05-30 12:01:00] INFO     POST /v1/chat/completions 200 2.3s
^C Log stream ended.

`endpoint models`

List deployed models on the VPS. Without flags, queries the running VPS for its current model roster.

Use case: See what models are available for inference, verify a model was pulled successfully, or inspect model metadata.

Flags:

Flag	Alias	Description
`--builtin`	`-b`	Show built-in models from local config (no VPS query needed)

Examples:

endpoint models                      # Query VPS for deployed models
endpoint models --builtin            # Show models defined in config

$ endpoint models --builtin
#   Name                                             Size    HuggingFace Repo
1   Qwen2.5-Coder-3B-Instruct-abliterated (default)  1.99 GB bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
2   Qwen2.5-1.5B-Instruct                            0.95 GB Qwen/Qwen2.5-1.5B-Instruct-GGUF

Hint:
  endpoint pull <hf-repo>   Deploy a new model
  endpoint boot             Start the VPS with these models

`endpoint pull [model]`

Pull a GGUF model from HuggingFace onto the VPS. The model is downloaded in the background on the VPS.

Use case: Add a new model (e.g., a fine-tune, larger/smaller variant, or different model family) without re-deploying the VPS.

Arguments:

Argument	Required	Description
`model`	No	HuggingFace model ID (e.g. `Qwen/Qwen2.5-1.5B-Instruct-GGUF`). Prompts if omitted

Examples:

endpoint pull Qwen/Qwen2.5-1.5B-Instruct-GGUF
endpoint pull                          # Interactive prompt

`endpoint remove [model]`

Remove a deployed model from the VPS to free disk space.

Use case: Free up storage on the VPS by removing unused models, or replace a model with a different quantization.

Arguments:

Argument	Required	Description
`model`	No	Model ID to remove. Prompts if omitted

Examples:

endpoint remove Qwen/Qwen2.5-1.5B-Instruct-GGUF

Removing model: Qwen/Qwen2.5-1.5B-Instruct-GGUF
✓ Model removed.

`endpoint settings [action] [key] [value]`

View or update engine parameters on the running VPS.

Use case: Tweak inference parameters (temperature, context length, threads) without restarting the VPS, or inspect current configuration.

Arguments:

Action	Required	Description
(none)	—	View all current settings (default when no arguments given)
`update`	Yes	Change a setting. Requires `key` and `value` arguments

Usage:

endpoint settings                          # View all current settings
endpoint settings update <key> <value>     # Update a specific setting

Supported settings keys:

Key	Type	Description
`context_length`	int	Context window size in tokens
`batch_size`	int	Batch size
`threads`	int	CPU inference threads
`temperature`	float	Sampling temperature (0.0–2.0)
`top_p`	float	Nucleus sampling threshold (0.0–1.0)
`top_k`	int	Top-K sampling
`repeat_penalty`	float	Repetition penalty (1.0 = none)
`min_p`	float	Min-P sampling threshold
`typical_p`	float	Typical sampling threshold
`flash_attn`	bool	Enable flash attention
`max_tokens`	int	Maximum tokens to generate
`ngl`	int	GPU layers to offload (0=CPU, 999=max)
`mlock`	bool	Lock model in physical RAM
`no_mmap`	bool	Disable memory-mapped model loading
`cache_size`	int	KV cache size
`ignore_eos`	bool	Ignore end-of-sequence token
`seed`	int	Random seed for reproducibility
`presence_penalty`	float	Presence penalty (-2.0–2.0)
`frequency_penalty`	float	Frequency penalty (-2.0–2.0)

Examples:

endpoint settings update temperature 0.3
endpoint settings update flash_attn true
endpoint settings update context_length 16384
endpoint settings update threads 8

$ endpoint settings
SETTINGS
{
  "context_length": 8192,
  "batch_size": 512,
  "temperature": 0.7,
  "top_p": 0.9,
  ...
}
Use: endpoint settings update <key> <value>

`endpoint init`

Interactive setup wizard for first-time configuration. Generates ~/.config/endpoint/endpoint-config.yaml.

Use case: First-time setup — creates the configuration file with your Kaggle username, kernel slug, and default model selection.

Flags: None.

Examples:

endpoint init

$ endpoint init
ENDPOINT SETUP WIZARD

Kaggle username [shesher]:
Kernel slug [endpoint-llm-vps]:

Default model:
  1) Qwen2.5-Coder-3B-Instruct-abliterated (default, 1.99 GB)
  2) Qwen2.5-1.5B-Instruct (0.95 GB)
Enter number or HF repo ID [1]: 2

✓ Configuration saved to ~/.config/endpoint/endpoint-config.yaml

Next steps:
  endpoint doctor       — Check dependencies
  endpoint boot         — Deploy the VPS

`endpoint doctor`

Run system diagnostics. Checks for required CLI tools (kaggle, curl, jq, git), Python modules, Kaggle API token validity, config status, and proxy health.

Use case: Verify your environment is ready before deploying, or diagnose issues when something isn't working.

Flags: None.

Examples:

endpoint doctor

$ endpoint doctor
CLI Tools:
  ✓ kaggle     found
  ✓ curl       found
  ✓ jq         found
  ✓ git        found

Python:
  ✓ yaml       module available

Kaggle API:
  ✓ Token found
  ✓ API working (1 kernel listed)

Config:
  ✓ Username:   shesher
  ✓ Kernel ID:  shesher/endpoint-llm-vps

Proxy:
  ✓ Proxy healthy — 200 OK

✓ All checks passed.

`endpoint provider-config`

Show Myth Org provider identity, live API key, and deployed models from the VPS with metadata (context length, type, reasoning capability). Optionally filter by model name and see usage examples.

Use case: Inspect provider details for integration with OpenAI-compatible clients, or get ready-to-use curl/Python examples for a specific model.

Flags:

Flag	Alias	Description
`--model`	`-m`	Filter displayed models (exact, prefix, or substring match)

Examples:

endpoint provider-config                         # Show all provider info
endpoint provider-config --model Qwen            # Filter to Qwen models
endpoint provider-config -m "Qwen2.5-Coder"     # Exact match with examples

PROVIDER CONFIGURATION
  Provider ID:     myth
  Provider Name:   MYTH Org
  Base URL:        https://random-1234.trycloudflare.com
  API Key:         sk-xxxxxxxxxxxx

Deployed Models:
  Model ID                        Display Name                      Context    Type    Reasoning
  Qwen2.5-Coder-3B-Instruct-...   Qwen 2.5 Coder 3B Abliterated     32K        text    no

Usage Examples (curl, Python):
  curl -X POST https://random-1234.trycloudflare.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer sk-xxxxxxxxxxxx" \
    -d '{"model":"Qwen2.5-Coder-3B-Instruct-...","messages":[{"role":"user","content":"Hello"}]}'

`endpoint autocomplete [shell]`

Install shell completion for bash or zsh. Appends the completion function to ~/.bashrc or ~/.zshrc.

Use case: Enable tab-completion for all endpoint commands, subcommands, and flags in your shell.

Arguments:

Argument	Required	Default	Description
`shell`	No	`bash`	Shell type: `bash` or `zsh`

Examples:

endpoint autocomplete                # Install bash completions (default)
source ~/.bashrc                     # Reload to activate

endpoint autocomplete zsh            # Install zsh completions
source ~/.zshrc                      # Reload to activate

✓ Bash completion installed. Restart your shell or run: source ~/.bashrc

`endpoint help [command]`

Show help for a specific command. Without arguments, prints the general help listing all commands.

Use case: Quick reference for a command's flags and usage without checking the full documentation.

Arguments:

Argument	Required	Description
`command`	No	Command name to get help for

Examples:

endpoint help                        # General help
endpoint help boot                   # Help for boot command
endpoint help kill-all               # Help for kill-all
endpoint help settings               # Help for settings

$ endpoint help boot
usage: endpoint boot [-h] [--no-watch] [--p100]

Deploy LLM VPS to Kaggle and stream status

options:
  -h, --help   show this help message and exit
  --no-watch   Build and push without streaming
  --p100       Use P100 GPU (default T4 x2)

`endpoint version`

Show version and build information.

Use case: Verify which version of Endpoint you have installed, confirm the active config path, and check the kernel name.

Flags: None.

Examples:

endpoint version

$ endpoint version
Endpoint CLI v0.1.0
License: GPLv3+
Engine version: 0.1.0
Config: /home/user/.config/endpoint/endpoint-config.yaml
Kernel: shesher/endpoint-llm-vps

Configuration

Endpoint uses a YAML configuration file loaded with layered precedence:

Bundled defaults — endpoint/data/endpoint-config.yaml
User config — ~/.config/endpoint/endpoint-config.yaml
Local override — ./endpoint-config.yaml (project root, gitignored)

Full Schema

# ── Identity ──────────────────────────────────────────────────────
identity:
  kaggle_username: "your-username"         # Kaggle username
  kernel_slug: "endpoint-llm-vps"          # Kaggle kernel slug
  api_key: ""                              # API key (auto-provisioned on boot)

# ── Engine ────────────────────────────────────────────────────────
engine:
  version: "0.1.0"                         # Engine version
  engine_port: 5003                        # FastAPI server port
  accelerator: "cpu"                       # cpu | gpu_t4 | gpu_p100 | tpu

# ── Signal (ntfy.sh) ─────────────────────────────────────────────
signal:
  topic_prefix: "endpoint"                 # ntfy.sh topic prefix

# ── LLM Inference ─────────────────────────────────────────────────
llm:
  port: 8080                               # llama.cpp server port
  context_length: 8192                     # Context window (tokens)
  batch_size: 512                          # Batch size
  threads: 4                               # CPU threads
  flash_attn: true                         # Flash attention
  max_tokens: 2048                         # Max generation tokens
  temperature: 0.7                         # Sampling temperature
  top_p: 0.9                               # Nucleus sampling
  top_k: 40                                # Top-K sampling
  repeat_penalty: 1.1                      # Repetition penalty
  min_p: 0.0                               # Min-P sampling
  typical_p: 0.0                           # Typical sampling
  ngl: 0                                   # GPU layers (0=CPU, 999=max)
  tensor_split: ""                         # Multi-GPU split: "1,1"
  mlock: false                             # Lock model in RAM
  no_mmap: false                           # Disable memory mapping
  cache_size: 0                            # KV cache size (0=auto)
  chat_template: ""                        # Jinja2 template override
  ignore_eos: false                        # Ignore EOS token

# ── Image Generation (sd-server) ─────────────────────────────────
image:
  port: 8081                               # sd-server port
  steps: 20                                # Denoising steps
  cfg_scale: 7.0                           # Guidance scale
  width: 1024                              # Output width
  height: 1024                             # Output height
  sampler: "euler"                         # Sampler type

# ── Video Generation ─────────────────────────────────────────────
video:
  port: 8081                               # sd-server port (shared)
  fps: 12                                  # Frames per second
  frames: 41                               # Total frames
  cfg_scale: 5.0                           # Video CFG scale
  steps: 20                                # Denoising steps

# ── Models ────────────────────────────────────────────────────────
default_model: "bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF"
default_model_file: "*.Q4_K_M.gguf"
default_model_index: 0

models:
  - name: "Qwen2.5-Coder-3B-Instruct-abliterated"
    hf_repo: "bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF"
    hf_file: "*.Q4_K_M.gguf"
    size_gb: 1.99
    temperature: 0.7
    top_p: 0.9
    top_k: 40
  - name: "Qwen2.5-1.5B-Instruct"
    hf_repo: "Qwen/Qwen2.5-1.5B-Instruct-GGUF"
    hf_file: "Qwen2.5-1.5B-Instruct-Q4_K_M.gguf"
    size_gb: 0.95
    temperature: 0.7
    top_p: 0.9
    top_k: 40

# ── Proxy ─────────────────────────────────────────────────────────
proxy:
  enabled: true
  domain: "api.endpoint.dpdns.org"
  url: "https://api.endpoint.dpdns.org"
  base_url: "https://api.endpoint.dpdns.org/v1"
  register_url: "https://api.endpoint.dpdns.org/__register"
  unregister_url: "https://api.endpoint.dpdns.org/__unregister"

# ── VPS Packages ─────────────────────────────────────────────────
packages:
  system:
    - curl
    - wget
    - tar
    - jq
    - python3-dev
    - pip
  python:
    - huggingface_hub
    - requests
    - fastapi
    - uvicorn[standard]
    - pydantic

Accelerator Reference

Accelerator	CLI Flag	Kaggle Type	Max Model Size	Quota
CPU	(default)	CPU only	20B params	Unlimited
GPU T4 x2	`--gpu` / `-g`	`NvidiaTeslaT4`	70B params	30h/week
GPU P100	`--p100`	`NvidiaTeslaP100`	40B params	30h/week
TPU v5e-8	`--tpu` / `-t`	`TpuV5E8`	100B params	20h/week

API Reference

Once deployed, the engine exposes a full OpenAI-compatible REST API.

Public Endpoints (no auth required)

Method	Path	Description
`GET`	`/`	Root status: service info, docs link
`GET`	`/health`	Health check: backend status, model loaded, uptime
`GET`	`/tunnel`	Tunnel URL and status
`GET`	`/metrics`	Request count, latency, error rate, latency buckets
`GET`	`/docs`	Swagger UI documentation
`GET`	`/openapi.json`	OpenAPI 3.0 schema
`GET`	`/v1/apikey`	Get the current API key

Authenticated Endpoints (Bearer token required)

Chat & Text

Method	Path	Description
`POST`	`/v1/chat/completions`	Chat completions with streaming SSE support
`POST`	`/v1/completions`	Legacy text completions

Chat Request supports all OpenAI parameters: model, messages, stream, temperature, top_p, max_tokens, stop, frequency_penalty, presence_penalty, logprobs, top_logprobs, seed, n, user, and reasoning_effort.

Models

Method	Path	Description
`GET`	`/v1/models`	List all models with metadata (type, context, size, reasoning)
`GET`	`/v1/models/{model_id}`	Get specific model details
`POST`	`/v1/models/pull?model=<id>`	Pull model from HuggingFace
`DELETE`	`/v1/models/{model_id}`	Remove a model

Embeddings & Rerank

Method	Path	Description
`POST`	`/v1/embeddings`	Generate text embeddings
`POST`	`/v1/rerank`	Re-rank documents by relevance

Image & Video

Method	Path	Description
`POST`	`/v1/images/generations`	Generate images (txt2img)
`POST`	`/v1/video/generations`	Generate video
`POST`	`/v1/video/edits`	Edit video

Tokenization & Moderation

Method	Path	Description
`POST`	`/v1/tokenize`	Tokenize text
`POST`	`/v1/detokenize`	Detokenize token IDs
`POST`	`/v1/moderations`	Content moderation

Settings & Logs

Method	Path	Description
`GET`	`/v1/settings`	Get current engine settings
`POST`	`/v1/settings`	Update engine settings
`GET`	`/v1/logs`	Get recent log lines
`GET`	`/v1/logs/stream`	SSE stream of live logs

Usage Examples

# Chat with streaming
curl -X POST https://your-tunnel.trycloudflare.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxx" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

from openai import OpenAI

client = OpenAI(
    base_url="https://your-tunnel.trycloudflare.com/v1",
    api_key="sk-xxxx",
)
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

# Generate embeddings
response = client.embeddings.create(
    model="default",
    input=["Hello world", "How are you?"],
)
print(response.data[0].embedding[:5])  # First 5 dimensions

Environment Variables

CLI Variables

Variable	Required	Default	Description
`KAGGLE_API_TOKEN`	Yes*	—	Kaggle API token (`kgat_...`). Alternative to `kaggle.json`
`ENDPOINT_API_KEY`	Auto-set	—	Engine API key (set by CLI during boot)
`CONFIG`	No	—	Path to YAML config file override

*Required if ~/.kaggle/kaggle.json is not present.

Engine Variables (set on the Kaggle VPS)

Variable	Default	Description
`LLAMA_PORT`	`8080`	llama.cpp server port
`SD_PORT`	`8081`	sd-server port
`ENGINE_PORT`	`5003`	FastAPI engine port
`LLM_SIGNAL_TOPIC`	`""`	ntfy.sh signal topic (auto-derived)
`LLM_CONTROL_TOPIC`	`""`	ntfy.sh control topic (auto-derived)
`SESSION_ID`	`"default"`	Session identifier
`ENGINE_VERSION`	`"0.1.0"`	Engine version string
`LLM_ACCELERATOR`	—	Accelerator type (`cpu`, `gpu_t4`, etc.)
`LLM_IDLE_TIMEOUT`	`3600`	Idle shutdown timeout in seconds (`0` = disable)
`RATE_LIMIT`	`"1"`	Enable rate limiting (`1`/`true`/`yes`)
`LLM_DEBUG`	—	Enable verbose error messages
`LLM_CONTEXT_LEN`	—	Context length override
`LLM_BATCH_SIZE`	—	Batch size override
`LLM_MAX_TOKENS`	—	Max tokens override
`LLM_NGL`	—	GPU layers override
`LLM_TENSOR_SPLIT`	—	Tensor split override
`DEFAULT_MODEL`	—	Default model ID override
`DEFAULT_MODEL_FILE`	—	Default model filename override
`MODEL_TYPE`	—	Model type override
`HF_TOKEN`	—	HuggingFace token for gated model downloads

Project Structure

endpoint/
├── endpoint/                        # CLI package (pip-installable)
│   ├── __init__.py                  # Version string
│   ├── __main__.py                  # python -m endpoint support
│   ├── main.py                      # Argparse parser, dispatch
│   ├── commands.py                  # All 19 command implementations
│   ├── core.py                      # Config, VPSClient, console, signals, cache
│   ├── py.typed                     # PEP 561 type marker
│   └── data/                        # Bundled defaults
│       ├── endpoint-config.yaml
│       └── endpoint-config.example.yaml
├── engine/                          # Inference server (runs on Kaggle)
│   ├── engine.py                    # FastAPI app: all API endpoints, lifecycle
│   └── models_config.py             # Model management, GGUF parsing, settings
├── scripts/                         # Build & automation
│   ├── master_build_notebook.py     # Kaggle notebook generator (962 lines)
│   ├── update_and_embed.py          # Engine payload sync into notebook
│   ├── lint.py                      # Code quality pipeline (ruff + mypy + pytest)
│   └── release.py                   # PyPI and GitHub release automation
├── cloudflare/                      # Cloudflare proxy worker
│   ├── deploy.py                    # Wrangler-based deployment orchestration
│   ├── proxy-worker.js              # CF Worker: rate limiting, KV tunnel map
│   ├── wrangler.toml                # Wrangler configuration
│   └── wrangler.example.toml        # Example wrangler config
├── tests/
│   └── sanity_test.py               # 26-test static analysis suite
├── man/
│   └── endpoint.1                   # Man page (roff format)
├── Makefile                         # Build, lint, test, publish targets
├── pyproject.toml                   # Package metadata + tool configuration
├── uv.lock                          # Dependency lock file
├── endpoint-config.yaml             # Local config (gitignored)
└── README.md

Development

Setup

git clone https://github.com/shesher/endpoint.git
cd endpoint
make install      # or: uv sync

Commands

make all          # Full pipeline: lint + test (default target)
make lint         # Run ruff linter (format check + lint)
make test         # Run pytest test suite
make build        # Build PyPI wheel + sdist
make install      # Editable install with uv
make release      # Dry-run release build
make publish      # Build & publish to PyPI
make clean        # Remove build artifacts

Quality Standards

Linting: ruff with full ruleset (E, F, I, N, W, UP, B, SIM, ARG, etc.)
Formatting: ruff formatter (compatible with Black)
Type Checking: mypy for CLI, engine, scripts, and tests
Testing: pytest with 26 static analysis tests covering syntax, security, naming, config schema, base64 roundtrip, version consistency, import resolution, license headers, file sizes, and more
CI: make all runs the full pipeline: lint → test

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

endpoint_vps-0.1.0.tar.gz (144.5 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

endpoint_vps-0.1.0-py3-none-any.whl (135.0 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file endpoint_vps-0.1.0.tar.gz.

File metadata

Download URL: endpoint_vps-0.1.0.tar.gz
Upload date: May 30, 2026
Size: 144.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Kali GNU/Linux","version":"2026.2","id":"kali-rolling","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for endpoint_vps-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e59a43ebb84e4a9b679364924c13d8978c4e0105b3c3ea5bba465ccad26bbce0`
MD5	`10c0325e00127432be4d22a309390129`
BLAKE2b-256	`248d3b53639955583315d088a373603db04ea0dcb30c605968527a0f1458ab3b`

See more details on using hashes here.

File details

Details for the file endpoint_vps-0.1.0-py3-none-any.whl.

File metadata

Download URL: endpoint_vps-0.1.0-py3-none-any.whl
Upload date: May 30, 2026
Size: 135.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Kali GNU/Linux","version":"2026.2","id":"kali-rolling","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for endpoint_vps-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`479ce947eefd267638735e1fcdd39a1bbfd9621311b0b5f4e63906815cfab015`
MD5	`d65281a2086b1bc0c51c6729617c0f5b`
BLAKE2b-256	`c956a55ab836b3a7c07d74833fbe504b8c3d1f8da5ebe7fad3e14ef4043ddefe`

See more details on using hashes here.

endpoint-vps 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

⚡ Endpoint VPS

Table of Contents

What is Endpoint?

Features

Core Platform

Inference Capabilities

Performance & Optimization

Operations

Security & Reliability

Quick Start

Prerequisites

Installation

Configure

Deploy

Verify & Use

Stop

Architecture

Signal Flow

Command Reference

Global Flags

endpoint boot

endpoint stop

endpoint status

endpoint watch

endpoint kill-all

endpoint register

endpoint base-url

endpoint connect

endpoint logs

endpoint models

endpoint pull [model]

endpoint remove [model]

endpoint settings [action] [key] [value]

endpoint init

endpoint doctor

endpoint provider-config

endpoint autocomplete [shell]

endpoint help [command]

endpoint version

Configuration

Full Schema

Accelerator Reference

API Reference

Public Endpoints (no auth required)

Authenticated Endpoints (Bearer token required)

Chat & Text

Models

Embeddings & Rerank

Image & Video

Tokenization & Moderation

Settings & Logs

Usage Examples

Environment Variables

CLI Variables

Engine Variables (set on the Kaggle VPS)

Project Structure

Development

Setup

Commands

Quality Standards

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

`endpoint boot`

`endpoint stop`

`endpoint status`

`endpoint watch`

`endpoint kill-all`

`endpoint register`

`endpoint base-url`

`endpoint connect`

`endpoint logs`

`endpoint models`

`endpoint pull [model]`

`endpoint remove [model]`

`endpoint settings [action] [key] [value]`

`endpoint init`

`endpoint doctor`

`endpoint provider-config`

`endpoint autocomplete [shell]`

`endpoint help [command]`

`endpoint version`