End-to-end partial-weight transfer pipeline.
Project description
ModelPulse ๐
End-to-end partial-weight transfer pipeline for edge LLM inference.
ModelPulse enables a unique "Zero-Disk" inference strategy: Device A (Server) serves model shards over the network, while Device B (Client/Bridge) reconstructs the model entirely in RAM and runs inference via llama.cpp without ever writing the full GGUF to physical storage.
Data Flow Diagram
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Server (Device A) โ
โ FastAPI @ 0.0.0.0:8000 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ WebSocket /ws (Control Plane) HTTP (Data Plane) โ
โ โโ MODEL_READY โโ GET /manifest โ
โ โโ PING/PONG โโ GET /shards/* โ
โ โโ METRICS โโ POST /metrics โ
โ โโ ACK/BYE โ
โ โ
โ /models/upload (Multipart) โ
โ โโ Accept manifest.json + *.shard files โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ โ
โ WS connect โ HTTP GET/POST
โ + MODEL_READY signal โ + shard stream
โ โ
โโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ
โ Client (Device B) โ
โ Bridge CLI โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. Connect WebSocket โ Send HELLO โ
โ 2. Receive MODEL_READY โ Fetch manifest (HTTP) โ
โ 3. Download shards (HTTP streaming) โ
โ 4. Assemble GGUF in /dev/shm โ
โ 5. Load with llama.cpp โ
โ 6. Run inference โ
โ 7. Send METRICS โ Wait for next MODEL_READY signal โ
โ (event-driven โ no polling, no restart required) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โจ Key Features
- ๐ก๏ธ Zero-Disk Strategy: Models are assembled in
tmpfs(/dev/shm), ensuring no persistent GGUF footprint on the client's disk. - ๐ Dynamic Model Swapping: Upload new models to the server at runtime; connected clients automatically unload, pull, and reload the new model without a restart.
- โก Delta Updates (New!): Update only the changed tensors in a model. The bridge patches its in-memory GGUF in real-time, downloading only a fraction of the full model size.
- ๐ Real-time Telemetry: Detailed inference metrics (TTFT, tok/s, RAM delta, CPU temp) are streamed back to the server for centralized monitoring.
- ๐ ๏ธ Integrated Benchmarking: Built-in suite to stress-test edge devices and validate performance across different quantization levels.
- ๐ Network Agnostic: Works seamlessly over local networks, Tailscale, or any HTTP/WS-capable connection.
๐ฆ Installation
Install ModelPulse from PyPI:
pip install modelpulse
Alternatively, install directly from the repository for the latest dev features:
pip install git+https://github.com/MdSufiyan005/ModelPulse.git
Note: Ensure you have llama-cpp-python dependencies installed on your system (e.g., build-essential, python3-dev).
๐ Workflow
1. Prepare Shards
Convert a monolithic .gguf file into a shard directory:
modelpulse server convert my_model.gguf ./my-shards/
2. Start the Server
Start the control plane on Device A. Use --log-dir to specify where inference metrics are saved.
modelpulse server run --host 0.0.0.0 --port 8000 --log-dir ./results
3. Run the Bridge
Connect your edge device to the server. It will wait for a model to be assigned.
modelpulse bridge run http://<server-ip>:8000
4. Dynamic Upload
Upload your prepared shards to the server. All connected bridges will instantly receive the update.
# Full Baseline Upload
modelpulse server upload "qwen-3.5-2b" "./my-shards/"
# Delta Update (Auto-Diff)
modelpulse server upload "qwen-3.5-2b-v2" "./new-shards/" --base "qwen-3.5-2b" --base-dir "./old-shards/"
๐ Command Reference
modelpulse server run
Start the FastAPI control plane.
| Option | Default | Description |
|---|---|---|
--shard-dir, -d |
./models-storage |
Root directory for model storage |
--host |
127.0.0.1 |
Bind address |
--port |
8000 |
Listening port |
--log-dir |
Current directory | Directory to save metrics.jsonl |
--ping-interval |
20.0 |
WebSocket ping interval (seconds) |
modelpulse server upload
Upload models or delta patches to the control plane.
| Option | Default | Description |
|---|---|---|
model_id |
(Required) | Unique slug for the new model |
paths |
(Required) | Shard directory or list of .shard files |
--base |
None |
Base model ID for delta update |
--base-dir |
None |
Local directory of base model for auto-diff |
--server |
http://127.0.0.1:8000 |
Target server URL |
modelpulse server convert
Convert a monolithic GGUF file into tensor-level shards.
| Argument | Description |
|---|---|
gguf_path |
Path to the monolithic .gguf file |
output_dir |
Directory to store the generated shards |
modelpulse bridge run
Connect to a server and enter the inference loop.
| Option | Default | Description |
|---|---|---|
host |
(Required) | Server URL (e.g., http://100.64.0.5:8000) |
--prompt |
None (listen mode) |
Send a single prompt then wait for further updates |
--benchmark, -b |
false |
Run the standard benchmark suite |
--max-tokens, -m |
256 |
Token generation limit |
--temperature |
0.7 |
Sampling temperature |
--n-ctx |
2048 |
Context window size |
--perplexity, -p |
false |
Compute perplexity score during benchmark |
modelpulse agent run
Run the iterative quantization + deployment optimizer.
| Option | Default | Description |
|---|---|---|
hf_model_id |
(Required) | Hugging Face model repo ID containing GGUF files |
--base-model-name |
(Required) | Stable model slug prefix used for iterations |
--hf-gguf-filename |
"" |
Optional specific GGUF filename to pick from the HF snapshot |
--gguf-path |
None |
Optional local GGUF path (skip HF download) |
--device-name |
(Required) | Target device name |
--ram-gb |
(Required) | Device RAM in GB |
--cpu |
(Required) | Target CPU model/name |
--gpu |
"" |
Optional GPU model/name |
--network |
zerotier |
Network type (zerotier, tailscale, cloudflare, lan) |
--max-iterations |
4 |
Number of optimization rounds (1-10) |
--blockwise-top-k |
64 |
Top-K changed shards to send in blockwise delta mode |
--prefer-quality |
false |
Prefer quality over speed when planning/scoring |
--require-llm-planner |
false |
Enforce Groq planner and fail fast instead of heuristic fallback |
--verbose-tool-logs |
false |
Print full quantization tool logs (default is compact agent output) |
--server |
http://127.0.0.1:8000 |
ModelPulse server URL |
--workspace |
./.modelpulse-agent |
Agent artifact workspace |
--hf-cache-dir |
./.modelpulse-agent/hf-cache |
Hugging Face cache directory |
--groq-model |
llama-3.3-70b-versatile |
Groq planner model |
--quant-bin |
llama-quantize |
Quantization binary path/name |
--llama-cpp-dir |
None |
Optional llama.cpp source dir. If omitted, bundled modelpulse/llama.cpp is used when available |
๐ Project Layout
modelpulse/
โโโ modelpulse/
โ โโโ main.py # Unified CLI entry point: bridge/server/agent
โ โโโ server/
โ โ โโโ app.py # FastAPI app (HTTP + WS control/data plane)
โ โ โโโ cli.py # Server commands: run/upload/convert
โ โ โโโ connection.py # WS client manager
โ โ โโโ helpers.py # File hash helpers
โ โโโ agent/
โ โโโ cli.py # Agent command and Rich output
โ โโโ orchestrator.py # Iterative optimization loop
โ โโโ planner.py # Groq planner + heuristic fallback
โ โโโ quantization.py # Quant plan + llama-quantize execution
โ โโโ downloader.py # Hugging Face GGUF fetch + selection
โ โโโ toolchain.py # Auto-detect/build llama-quantize
โ โโโ models.py # Agent dataclasses/report model
โโโ README.md
โโโ pyproject.toml
๐พ The Zero-Disk Strategy
ModelPulse leverages the Linux tmpfs (RAM-backed filesystem) to satisfy llama.cpp's requirement for a file path while keeping the actual data off physical storage:
- Pull: Bridge fetches
manifest.json. - Stream: Bridge pulls
.shardfiles (tensor by tensor) into memory. - Assemble: Bridge calculates GGUF layout and writes bytes to
/dev/shm/sb_<pid>.gguf. - Load:
llama-cpp-pythonloads the model viammapfrom the RAM-backed file. - Clean: Once the model is unloaded, the virtual file is unlinked and memory is reclaimed.
๐ก Networking (Tailscale, ZeroTier, Cloudflare)
For easy cross-device connectivity without port forwarding:
1. ZeroTier (Recommended for Large Models)
ZeroTier creates a virtual LAN with no payload limits, making it ideal for multi-GB model uploads.
# Connect Bridge to Server's ZeroTier IP
modelpulse bridge run http://10.147.17.100:8000 --benchmark
2. Tailscale
Standard virtual private networking:
# Get IP on Server
tailscale ip # e.g., 100.66.170.100
# Connect Bridge
modelpulse bridge run http://100.66.170.100:8000
3. Cloudflare Tunnel
Good for public access, but note the 100MB upload limit on the free tier which may affect server upload commands.
# Connect Bridge to public tunnel URL
modelpulse bridge run https://modelpulse.your-domain.com
๐ค Agentic Quant Optimization
ModelPulse now includes an iterative optimization agent that:
- accepts model + device requirements,
- uses a Groq planner to choose quantization strategy (
full_quantortensor_blockwise), - quantizes and converts GGUF to shards,
- deploys each iteration via the same
modelpulse server uploadflow (full + auto-diff delta), - reads benchmark metrics from
/results/latest, - computes KL-divergence over changed tensor shards for blockwise mode and can send only top-K shards,
- uses metric context for next iteration and recommends the best-fit model.
export GROQ_API_KEY="..."
modelpulse agent run "Qwen/Qwen2.5-0.5B-Instruct-GGUF" \
--base-model-name "qwen-0.5b" \
--hf-gguf-filename "qwen2.5-0.5b-instruct-f16.gguf" \
--device-name "edge-rpi-5" \
--ram-gb 8 \
--cpu "Cortex-A76" \
--network zerotier \
--server http://10.147.17.100:8000 \
--max-iterations 4 \
--blockwise-top-k 64 \
--require-llm-planner \
--llama-cpp-dir "/home/haider/Coding/build-edgeopt/b-device/agent/Chiseled/llama.cpp"
The agent auto-downloads GGUF from Hugging Face into cache (unless --gguf-path is provided)
and auto-resolves/builds llama-quantize if it is missing.
If GROQ_API_KEY is not set, set to an empty string, or Groq SDK is unavailable, planner decisions fall back
to an internal heuristic so optimization can still run end-to-end.
Use --require-llm-planner to disable fallback and force LLM-only planning.
Artifacts are saved in ./.modelpulse-agent/, including optimization-report.json.
Built with โค๏ธ for Edge AI and Decentralized Inference.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file modelpulse-0.4.5.tar.gz.
File metadata
- Download URL: modelpulse-0.4.5.tar.gz
- Upload date:
- Size: 73.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e6a11f55f203276a112f25abf8be4bc83589c024606c5dd6a9f62c791cc5d32
|
|
| MD5 |
86d177c85437d71ba1bc5e988036b8fd
|
|
| BLAKE2b-256 |
49059f1ba1958b4b519adf16ff824dd2f66c080fa714ff3c910b07f0b8195cc8
|
File details
Details for the file modelpulse-0.4.5-py3-none-any.whl.
File metadata
- Download URL: modelpulse-0.4.5-py3-none-any.whl
- Upload date:
- Size: 77.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e4b53d7ebd233ddb3fc7f8ba4e1ba5f1bd584c936cb9343dd5b57e5f6e5209d
|
|
| MD5 |
c8d9cb00bc8b731c5b77bbd92b155ad9
|
|
| BLAKE2b-256 |
79b5593a67d83e8c9ad3fbee37d13f831e67d477c1d4adb08a7e0d9f08680e48
|