Skip to main content

End-to-end partial-weight transfer pipeline.

Project description

ModelPulse ๐Ÿš€

End-to-end partial-weight transfer pipeline for edge LLM inference.

ModelPulse enables a unique "Zero-Disk" inference strategy: Device A (Server) serves model shards over the network, while Device B (Client/Bridge) reconstructs the model entirely in RAM and runs inference via llama.cpp without ever writing the full GGUF to physical storage.

Data Flow Diagram

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Server (Device A)                       โ”‚
โ”‚                  FastAPI @ 0.0.0.0:8000                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚  WebSocket /ws (Control Plane)   HTTP (Data Plane)          โ”‚
โ”‚  โ”œโ”€ MODEL_READY                  โ”œโ”€ GET /manifest           โ”‚
โ”‚  โ”œโ”€ PING/PONG                    โ”œโ”€ GET /shards/*           โ”‚
โ”‚  โ”œโ”€ METRICS                      โ””โ”€ POST /metrics           โ”‚
โ”‚  โ””โ”€ ACK/BYE                                                 โ”‚
โ”‚                                                             โ”‚
โ”‚  /models/upload (Multipart)                                 โ”‚
โ”‚  โ””โ”€ Accept manifest.json + *.shard files                    โ”‚
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ†‘                              โ†‘
         โ”‚                              โ”‚
         โ”‚ WS connect                   โ”‚ HTTP GET/POST
         โ”‚ + MODEL_READY signal         โ”‚ + shard stream
         โ”‚                              โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Client (Device B)                         โ”‚
โ”‚                       Bridge CLI                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚  1. Connect WebSocket โ†’ Send HELLO                          โ”‚
โ”‚  2. Receive MODEL_READY โ†’ Fetch manifest (HTTP)             โ”‚
โ”‚  3. Download shards (HTTP streaming)                        โ”‚
โ”‚  4. Assemble GGUF in /dev/shm                               โ”‚
โ”‚  5. Load with llama.cpp                                     โ”‚
โ”‚  6. Run inference                                           โ”‚
โ”‚  7. Send METRICS โ†’ Wait for next MODEL_READY signal         โ”‚
โ”‚     (event-driven โ€” no polling, no restart required)        โ”‚ 
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โœจ Key Features

  • ๐Ÿ›ก๏ธ Zero-Disk Strategy: Models are assembled in tmpfs (/dev/shm), ensuring no persistent GGUF footprint on the client's disk.
  • ๐Ÿ”„ Dynamic Model Swapping: Upload new models to the server at runtime; connected clients automatically unload, pull, and reload the new model without a restart.
  • โšก Delta Updates (New!): Update only the changed tensors in a model. The bridge patches its in-memory GGUF in real-time, downloading only a fraction of the full model size.
  • ๐Ÿ“Š Real-time Telemetry: Detailed inference metrics (TTFT, tok/s, RAM delta, CPU temp) are streamed back to the server for centralized monitoring.
  • ๐Ÿ› ๏ธ Integrated Benchmarking: Built-in suite to stress-test edge devices and validate performance across different quantization levels.
  • ๐ŸŒ Network Agnostic: Works seamlessly over local networks, Tailscale, or any HTTP/WS-capable connection.

๐Ÿ“ฆ Installation

Install ModelPulse from PyPI:

pip install modelpulse

Alternatively, install directly from the repository for the latest dev features:

pip install git+https://github.com/MdSufiyan005/ModelPulse.git

Note: Ensure you have llama-cpp-python dependencies installed on your system (e.g., build-essential, python3-dev).


๐Ÿ”„ Workflow

1. Prepare Shards

Convert a monolithic .gguf file into a shard directory:

modelpulse server convert my_model.gguf ./my-shards/

2. Start the Server

Start the control plane on Device A. Use --log-dir to specify where inference metrics are saved.

modelpulse server run --host 0.0.0.0 --port 8000 --log-dir ./results

3. Run the Bridge

Connect your edge device to the server. It will wait for a model to be assigned.

modelpulse bridge run http://<server-ip>:8000

4. Dynamic Upload

Upload your prepared shards to the server. All connected bridges will instantly receive the update.

# Full Baseline Upload
modelpulse server upload "qwen-3.5-2b" "./my-shards/"

# Delta Update (Auto-Diff)
modelpulse server upload "qwen-3.5-2b-v2" "./new-shards/" --base "qwen-3.5-2b" --base-dir "./old-shards/"

๐Ÿ“‹ Command Reference

modelpulse server run

Start the FastAPI control plane.

Option Default Description
--shard-dir, -d ./models-storage Root directory for model storage
--host 127.0.0.1 Bind address
--port 8000 Listening port
--log-dir Current directory Directory to save metrics.jsonl
--ping-interval 20.0 WebSocket ping interval (seconds)

modelpulse server upload

Upload models or delta patches to the control plane.

Option Default Description
model_id (Required) Unique slug for the new model
paths (Required) Shard directory or list of .shard files
--base None Base model ID for delta update
--base-dir None Local directory of base model for auto-diff
--server http://127.0.0.1:8000 Target server URL

modelpulse server convert

Convert a monolithic GGUF file into tensor-level shards.

Argument Description
gguf_path Path to the monolithic .gguf file
output_dir Directory to store the generated shards

modelpulse bridge run

Connect to a server and enter the inference loop.

Option Default Description
host (Required) Server URL (e.g., http://100.64.0.5:8000)
--prompt None (listen mode) Send a single prompt then wait for further updates
--benchmark, -b false Run the standard benchmark suite
--max-tokens, -m 256 Token generation limit
--temperature 0.7 Sampling temperature
--n-ctx 2048 Context window size
--perplexity, -p false Compute perplexity score during benchmark

modelpulse agent run

Run the iterative quantization + deployment optimizer.

Option Default Description
hf_model_id (Required) Hugging Face model repo ID containing GGUF files
--base-model-name (Required) Stable model slug prefix used for iterations
--hf-gguf-filename "" Optional specific GGUF filename to pick from the HF snapshot
--gguf-path None Optional local GGUF path (skip HF download)
--device-name (Required) Target device name
--ram-gb (Required) Device RAM in GB
--cpu (Required) Target CPU model/name
--gpu "" Optional GPU model/name
--network zerotier Network type (zerotier, tailscale, cloudflare, lan)
--max-iterations 4 Number of optimization rounds (1-10)
--blockwise-top-k 64 Top-K changed shards to send in blockwise delta mode
--prefer-quality false Prefer quality over speed when planning/scoring
--require-llm-planner false Enforce Groq planner and fail fast instead of heuristic fallback
--verbose-tool-logs false Print full quantization tool logs (default is compact agent output)
--server http://127.0.0.1:8000 ModelPulse server URL
--workspace ./.modelpulse-agent Agent artifact workspace
--hf-cache-dir ./.modelpulse-agent/hf-cache Hugging Face cache directory
--groq-model llama-3.3-70b-versatile Groq planner model
--quant-bin llama-quantize Quantization binary path/name
--llama-cpp-dir None Optional llama.cpp source dir. If omitted, bundled modelpulse/llama.cpp is used when available

๐Ÿ“ Project Layout

modelpulse/
โ”œโ”€โ”€ modelpulse/
โ”‚   โ”œโ”€โ”€ main.py                 # Unified CLI entry point: bridge/server/agent
โ”‚   โ”œโ”€โ”€ server/
โ”‚   โ”‚   โ”œโ”€โ”€ app.py              # FastAPI app (HTTP + WS control/data plane)
โ”‚   โ”‚   โ”œโ”€โ”€ cli.py              # Server commands: run/upload/convert
โ”‚   โ”‚   โ”œโ”€โ”€ connection.py       # WS client manager
โ”‚   โ”‚   โ””โ”€โ”€ helpers.py          # File hash helpers
โ”‚   โ””โ”€โ”€ agent/
โ”‚       โ”œโ”€โ”€ cli.py              # Agent command and Rich output
โ”‚       โ”œโ”€โ”€ orchestrator.py     # Iterative optimization loop
โ”‚       โ”œโ”€โ”€ planner.py          # Groq planner + heuristic fallback
โ”‚       โ”œโ”€โ”€ quantization.py     # Quant plan + llama-quantize execution
โ”‚       โ”œโ”€โ”€ downloader.py       # Hugging Face GGUF fetch + selection
โ”‚       โ”œโ”€โ”€ toolchain.py        # Auto-detect/build llama-quantize
โ”‚       โ””โ”€โ”€ models.py           # Agent dataclasses/report model
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ pyproject.toml

๐Ÿ’พ The Zero-Disk Strategy

ModelPulse leverages the Linux tmpfs (RAM-backed filesystem) to satisfy llama.cpp's requirement for a file path while keeping the actual data off physical storage:

  1. Pull: Bridge fetches manifest.json.
  2. Stream: Bridge pulls .shard files (tensor by tensor) into memory.
  3. Assemble: Bridge calculates GGUF layout and writes bytes to /dev/shm/sb_<pid>.gguf.
  4. Load: llama-cpp-python loads the model via mmap from the RAM-backed file.
  5. Clean: Once the model is unloaded, the virtual file is unlinked and memory is reclaimed.

๐Ÿ“ก Networking (Tailscale, ZeroTier, Cloudflare)

For easy cross-device connectivity without port forwarding:

1. ZeroTier (Recommended for Large Models)

ZeroTier creates a virtual LAN with no payload limits, making it ideal for multi-GB model uploads.

# Connect Bridge to Server's ZeroTier IP
modelpulse bridge run http://10.147.17.100:8000 --benchmark

2. Tailscale

Standard virtual private networking:

# Get IP on Server
tailscale ip  # e.g., 100.66.170.100

# Connect Bridge
modelpulse bridge run http://100.66.170.100:8000

3. Cloudflare Tunnel

Good for public access, but note the 100MB upload limit on the free tier which may affect server upload commands.

# Connect Bridge to public tunnel URL
modelpulse bridge run https://modelpulse.your-domain.com

๐Ÿค– Agentic Quant Optimization

ModelPulse now includes an iterative optimization agent that:

  • accepts model + device requirements,
  • uses a Groq planner to choose quantization strategy (full_quant or tensor_blockwise),
  • quantizes and converts GGUF to shards,
  • deploys each iteration via the same modelpulse server upload flow (full + auto-diff delta),
  • reads benchmark metrics from /results/latest,
  • computes KL-divergence over changed tensor shards for blockwise mode and can send only top-K shards,
  • uses metric context for next iteration and recommends the best-fit model.
export GROQ_API_KEY="..."
modelpulse agent run "Qwen/Qwen2.5-0.5B-Instruct-GGUF" \
  --base-model-name "qwen-0.5b" \
  --hf-gguf-filename "qwen2.5-0.5b-instruct-f16.gguf" \
  --device-name "edge-rpi-5" \
  --ram-gb 8 \
  --cpu "Cortex-A76" \
  --network zerotier \
  --server http://10.147.17.100:8000 \
  --max-iterations 4 \
  --blockwise-top-k 64 \
  --require-llm-planner \
  --llama-cpp-dir "/home/haider/Coding/build-edgeopt/b-device/agent/Chiseled/llama.cpp"

The agent auto-downloads GGUF from Hugging Face into cache (unless --gguf-path is provided) and auto-resolves/builds llama-quantize if it is missing.

If GROQ_API_KEY is not set, set to an empty string, or Groq SDK is unavailable, planner decisions fall back to an internal heuristic so optimization can still run end-to-end. Use --require-llm-planner to disable fallback and force LLM-only planning.

Artifacts are saved in ./.modelpulse-agent/, including optimization-report.json.


Built with โค๏ธ for Edge AI and Decentralized Inference.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelpulse-0.4.3.tar.gz (73.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modelpulse-0.4.3-py3-none-any.whl (77.1 MB view details)

Uploaded Python 3

File details

Details for the file modelpulse-0.4.3.tar.gz.

File metadata

  • Download URL: modelpulse-0.4.3.tar.gz
  • Upload date:
  • Size: 73.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for modelpulse-0.4.3.tar.gz
Algorithm Hash digest
SHA256 b9f832ee0d3b08d9eadddea5d7d09fe906ae5e3ef0e77534ebc009a741529f63
MD5 8984bd2cf704f1b054034108cfd9e0c8
BLAKE2b-256 513bd36a6278974732cf00410b3cd548cd75fc3ad0857d98e0e38c9bf38cda6a

See more details on using hashes here.

File details

Details for the file modelpulse-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: modelpulse-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 77.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for modelpulse-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d554560f2d5dc4f1eb61e65e86471520c4cf67c9785c8935f539c17e995bb5fd
MD5 05603cb0b07a7a083c4ed7b220614bbb
BLAKE2b-256 bebead23857b66898f6c271790e0087394d444bdf7a8b9a50876726f9d7530a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page