End-to-end partial-weight transfer pipeline.
Project description
ModelPulse ๐
End-to-end partial-weight transfer pipeline for edge LLM inference.
ModelPulse enables a unique "Zero-Disk" inference strategy: Device A (Server) serves model shards over the network, while Device B (Client/Bridge) reconstructs the model entirely in RAM and runs inference via llama.cpp without ever writing the full GGUF to physical storage.
Data Flow Diagram
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Server (Device A) โ
โ FastAPI @ 0.0.0.0:8000 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ WebSocket /ws (Control Plane) HTTP (Data Plane) โ
โ โโ MODEL_READY โโ GET /manifest โ
โ โโ PING/PONG โโ GET /shards/* โ
โ โโ METRICS โโ POST /metrics โ
โ โโ ACK/BYE โ
โ โ
โ /models/upload (Multipart) โ
โ โโ Accept manifest.json + *.shard files โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ โ
โ WS connect โ HTTP GET/POST
โ + MODEL_READY signal โ + shard stream
โ โ
โโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ
โ Client (Device B) โ
โ Bridge CLI โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. Connect WebSocket โ Send HELLO โ
โ 2. Receive MODEL_READY โ Fetch manifest (HTTP) โ
โ 3. Download shards (HTTP streaming) โ
โ 4. Assemble GGUF in /dev/shm โ
โ 5. Load with llama.cpp โ
โ 6. Run inference โ
โ 7. Send METRICS โ Loop back to step 2 โ
โ (no restart, listen for next model) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โจ Key Features
- ๐ก๏ธ Zero-Disk Strategy: Models are assembled in
tmpfs(/dev/shm), ensuring no persistent GGUF footprint on the client's disk. - ๐ Dynamic Model Swapping: Upload new models to the server at runtime; connected clients automatically unload, pull, and reload the new model without a restart.
- ๐ Real-time Telemetry: Detailed inference metrics (TTFT, tok/s, RAM delta, CPU temp) are streamed back to the server for centralized monitoring.
- ๐ ๏ธ Integrated Benchmarking: Built-in suite to stress-test edge devices and validate performance across different quantization levels.
- ๐ Network Agnostic: Works seamlessly over local networks, Tailscale, or any HTTP/WS-capable connection.
๐ฆ Installation
Install ModelPulse from PyPI:
pip install modelpulse
Alternatively, install directly from the repository for the latest dev features:
pip install git+https://github.com/MdSufiyan005/ModelPulse.git
Note: Ensure you have llama-cpp-python dependencies installed on your system (e.g., build-essential, python3-dev).
๐ Workflow
1. Prepare Shards
Convert a monolithic .gguf file into a shard directory using the companion tool:
python tools/gguf_to_shards.py convert my_model.gguf ./my-shards/
2. Start the Server
Start the control plane on Device A. It will default to using ./models-storage for storing model data.
modelpulse server run --host 0.0.0.0 --port 8000
3. Run the Bridge
Connect your edge device to the server. It will wait for a model to be assigned.
modelpulse bridge run http://<server-ip>:8000
4. Dynamic Upload
Upload your prepared shards to the server. All connected bridges will instantly receive the update.
./upload_model.sh "qwen-3.5-2b" "./my-shards/"
๐ Command Reference
modelpulse server run
Start the FastAPI control plane.
| Option | Default | Description |
|---|---|---|
--shard-dir, -d |
./models-storage |
Root directory for model storage |
--host |
127.0.0.1 |
Bind address |
--port |
8000 |
Listening port |
--metrics-log |
metrics.jsonl |
File to append received telemetry |
modelpulse bridge run
Connect to a server and enter the inference loop.
| Option | Default | Description |
|---|---|---|
host |
(Required) | Server URL (e.g., http://100.64.0.5:8000) |
--prompt, -p |
(Interactive) | Run a single prompt and exit |
--benchmark, -b |
false |
Run the standard benchmark suite |
--max-tokens, -m |
256 |
Token generation limit |
--temperature |
0.7 |
Sampling temperature |
--n-ctx |
2048 |
Context window size |
๐ Project Layout
modelpulse/
โโโ modelpulse/ # Core package
โ โโโ server/
โ โ โโโ server.py # FastAPI + WebSocket control plane
โ โโโ client/ # Bridge (Device B) logic
โ โ โโโ cli.py # Claude-inspired terminal UI
โ โ โโโ bridge.py # RAM GGUF assembly & llama.cpp loading
โ โ โโโ shard_client.py # Async HTTP downloader for shards
โ โ โโโ benchmarks.py # Built-in performance testing suite
โ โโโ shared/ # Cross-component protocol definitions
โ โ โโโ ws_protocol.py # WebSocket message schemas
โ โ โโโ models.py # ShardManifest & InferenceMetrics models
โ โโโ main.py # Unified CLI entry point
โโโ tools/ # Model preparation utilities
โ โโโ gguf_to_shards.py # GGUF โ Shard converter (tensor-level)
โ โโโ gguf_parser.py # Low-level GGUF format metadata reader
โโโ upload_model.sh # Script for dynamic model assignment
โโโ TEST_WORKFLOW.md # Step-by-step end-to-end testing guide
โโโ pyproject.toml # Project metadata & dependencies
โโโ metrics.jsonl # Appends log for inference telemetry
๐พ The Zero-Disk Strategy
ModelPulse leverages the Linux tmpfs (RAM-backed filesystem) to satisfy llama.cpp's requirement for a file path while keeping the actual data off physical storage:
- Pull: Bridge fetches
manifest.json. - Stream: Bridge pulls
.shardfiles (tensor by tensor) into memory. - Assemble: Bridge calculates GGUF layout and writes bytes to
/dev/shm/sb_<pid>.gguf. - Load:
llama-cpp-pythonloads the model viammapfrom the RAM-backed file. - Clean: Once the model is unloaded, the virtual file is unlinked and memory is reclaimed.
๐ก Networking (Tailscale)
For easy cross-device connectivity without port forwarding, Tailscale is highly recommended:
# Get IP on Server
tailscale ip # e.g., 100.66.170.100
# Connect Bridge
modelpulse bridge run http://100.66.170.100:8000
Built with โค๏ธ for Edge AI and Decentralized Inference.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file modelpulse-0.2.0.tar.gz.
File metadata
- Download URL: modelpulse-0.2.0.tar.gz
- Upload date:
- Size: 34.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d6ed8c2c479f1914abd3715e87bebfc6a21046d2dae42b50a25cb7d7e9aa6c0
|
|
| MD5 |
8906eba4f2c0fb67b48687bde9bef941
|
|
| BLAKE2b-256 |
dbaeb4bb1eedda2139b6a1dc4f1e1f475fb6dfe98f2b4ad6302b06e61fcd52c5
|
File details
Details for the file modelpulse-0.2.0-py3-none-any.whl.
File metadata
- Download URL: modelpulse-0.2.0-py3-none-any.whl
- Upload date:
- Size: 32.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca1634756cbb6911f09cdd8d51f48b7253d261a5e29dedb865aca9ebad406aef
|
|
| MD5 |
b417bd46488acda73a3f4ceb87f3b8b9
|
|
| BLAKE2b-256 |
86fb2688e870827e7386326fdb08270186174389c73943708109e797a1e38d72
|