Skip to main content

End-to-end partial-weight transfer pipeline.

Project description

ModelPulse ๐Ÿš€

End-to-end partial-weight transfer pipeline for edge LLM inference.

ModelPulse enables a unique "Zero-Disk" inference strategy: Device A (Server) serves model shards over the network, while Device B (Client/Bridge) reconstructs the model entirely in RAM and runs inference via llama.cpp without ever writing the full GGUF to physical storage.

Data Flow Diagram

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Server (Device A)                       โ”‚
โ”‚                  FastAPI @ 0.0.0.0:8000                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚  WebSocket /ws (Control Plane)   HTTP (Data Plane)          โ”‚
โ”‚  โ”œโ”€ MODEL_READY                  โ”œโ”€ GET /manifest           โ”‚
โ”‚  โ”œโ”€ PING/PONG                    โ”œโ”€ GET /shards/*           โ”‚
โ”‚  โ”œโ”€ METRICS                      โ””โ”€ POST /metrics           โ”‚
โ”‚  โ””โ”€ ACK/BYE                                                 โ”‚
โ”‚                                                             โ”‚
โ”‚  /models/upload (Multipart)                                 โ”‚
โ”‚  โ””โ”€ Accept manifest.json + *.shard files                    โ”‚
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ†‘                              โ†‘
         โ”‚                              โ”‚
         โ”‚ WS connect                   โ”‚ HTTP GET/POST
         โ”‚ + MODEL_READY signal         โ”‚ + shard stream
         โ”‚                              โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Client (Device B)                         โ”‚
โ”‚                       Bridge CLI                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚  1. Connect WebSocket โ†’ Send HELLO                          โ”‚
โ”‚  2. Receive MODEL_READY โ†’ Fetch manifest (HTTP)             โ”‚
โ”‚  3. Download shards (HTTP streaming)                        โ”‚
โ”‚  4. Assemble GGUF in /dev/shm                               โ”‚
โ”‚  5. Load with llama.cpp                                     โ”‚
โ”‚  6. Run inference                                           โ”‚
โ”‚  7. Send METRICS โ†’ Loop back to step 2                      โ”‚
โ”‚     (no restart, listen for next model)                     โ”‚ 
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โœจ Key Features

  • ๐Ÿ›ก๏ธ Zero-Disk Strategy: Models are assembled in tmpfs (/dev/shm), ensuring no persistent GGUF footprint on the client's disk.
  • ๐Ÿ”„ Dynamic Model Swapping: Upload new models to the server at runtime; connected clients automatically unload, pull, and reload the new model without a restart.
  • ๐Ÿ“Š Real-time Telemetry: Detailed inference metrics (TTFT, tok/s, RAM delta, CPU temp) are streamed back to the server for centralized monitoring.
  • ๐Ÿ› ๏ธ Integrated Benchmarking: Built-in suite to stress-test edge devices and validate performance across different quantization levels.
  • ๐ŸŒ Network Agnostic: Works seamlessly over local networks, Tailscale, or any HTTP/WS-capable connection.

๐Ÿ“ฆ Installation

Install ModelPulse from PyPI:

pip install modelpulse

Alternatively, install directly from the repository for the latest dev features:

pip install git+https://github.com/MdSufiyan005/ModelPulse.git

Note: Ensure you have llama-cpp-python dependencies installed on your system (e.g., build-essential, python3-dev).


๐Ÿ”„ Workflow

1. Prepare Shards

Convert a monolithic .gguf file into a shard directory using the companion tool:

python tools/gguf_to_shards.py convert my_model.gguf ./my-shards/

2. Start the Server

Start the control plane on Device A. It will default to using ./models-storage for storing model data.

modelpulse server run --host 0.0.0.0 --port 8000

3. Run the Bridge

Connect your edge device to the server. It will wait for a model to be assigned.

modelpulse bridge run http://<server-ip>:8000

4. Dynamic Upload

Upload your prepared shards to the server. All connected bridges will instantly receive the update.

./upload_model.sh "qwen-3.5-2b" "./my-shards/"

๐Ÿ“‹ Command Reference

modelpulse server run

Start the FastAPI control plane.

Option Default Description
--shard-dir, -d ./models-storage Root directory for model storage
--host 127.0.0.1 Bind address
--port 8000 Listening port
--metrics-log metrics.jsonl File to append received telemetry

modelpulse bridge run

Connect to a server and enter the inference loop.

Option Default Description
host (Required) Server URL (e.g., http://100.64.0.5:8000)
--prompt, -p (Interactive) Run a single prompt and exit
--benchmark, -b false Run the standard benchmark suite
--max-tokens, -m 256 Token generation limit
--temperature 0.7 Sampling temperature
--n-ctx 2048 Context window size

๐Ÿ“ Project Layout

modelpulse/
โ”œโ”€โ”€ modelpulse/             # Core package
โ”‚   โ”œโ”€โ”€ server/             
โ”‚   โ”‚   โ””โ”€โ”€ server.py       # FastAPI + WebSocket control plane
โ”‚   โ”œโ”€โ”€ client/             # Bridge (Device B) logic
โ”‚   โ”‚   โ”œโ”€โ”€ cli.py          # Claude-inspired terminal UI
โ”‚   โ”‚   โ”œโ”€โ”€ bridge.py       # RAM GGUF assembly & llama.cpp loading
โ”‚   โ”‚   โ”œโ”€โ”€ shard_client.py # Async HTTP downloader for shards
โ”‚   โ”‚   โ””โ”€โ”€ benchmarks.py   # Built-in performance testing suite
โ”‚   โ”œโ”€โ”€ shared/             # Cross-component protocol definitions
โ”‚   โ”‚   โ”œโ”€โ”€ ws_protocol.py  # WebSocket message schemas
โ”‚   โ”‚   โ””โ”€โ”€ models.py       # ShardManifest & InferenceMetrics models
โ”‚   โ””โ”€โ”€ main.py             # Unified CLI entry point
โ”œโ”€โ”€ tools/                  # Model preparation utilities
โ”‚   โ”œโ”€โ”€ gguf_to_shards.py   # GGUF โ†’ Shard converter (tensor-level)
โ”‚   โ””โ”€โ”€ gguf_parser.py      # Low-level GGUF format metadata reader
โ”œโ”€โ”€ upload_model.sh         # Script for dynamic model assignment
โ”œโ”€โ”€ TEST_WORKFLOW.md        # Step-by-step end-to-end testing guide
โ”œโ”€โ”€ pyproject.toml          # Project metadata & dependencies
โ””โ”€โ”€ metrics.jsonl           # Appends log for inference telemetry

๐Ÿ’พ The Zero-Disk Strategy

ModelPulse leverages the Linux tmpfs (RAM-backed filesystem) to satisfy llama.cpp's requirement for a file path while keeping the actual data off physical storage:

  1. Pull: Bridge fetches manifest.json.
  2. Stream: Bridge pulls .shard files (tensor by tensor) into memory.
  3. Assemble: Bridge calculates GGUF layout and writes bytes to /dev/shm/sb_<pid>.gguf.
  4. Load: llama-cpp-python loads the model via mmap from the RAM-backed file.
  5. Clean: Once the model is unloaded, the virtual file is unlinked and memory is reclaimed.

๐Ÿ“ก Networking (Tailscale)

For easy cross-device connectivity without port forwarding, Tailscale is highly recommended:

# Get IP on Server
tailscale ip  # e.g., 100.66.170.100

# Connect Bridge
modelpulse bridge run http://100.66.170.100:8000

Built with โค๏ธ for Edge AI and Decentralized Inference.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelpulse-0.2.0.tar.gz (34.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modelpulse-0.2.0-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file modelpulse-0.2.0.tar.gz.

File metadata

  • Download URL: modelpulse-0.2.0.tar.gz
  • Upload date:
  • Size: 34.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for modelpulse-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4d6ed8c2c479f1914abd3715e87bebfc6a21046d2dae42b50a25cb7d7e9aa6c0
MD5 8906eba4f2c0fb67b48687bde9bef941
BLAKE2b-256 dbaeb4bb1eedda2139b6a1dc4f1e1f475fb6dfe98f2b4ad6302b06e61fcd52c5

See more details on using hashes here.

File details

Details for the file modelpulse-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: modelpulse-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 32.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for modelpulse-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ca1634756cbb6911f09cdd8d51f48b7253d261a5e29dedb865aca9ebad406aef
MD5 b417bd46488acda73a3f4ceb87f3b8b9
BLAKE2b-256 86fb2688e870827e7386326fdb08270186174389c73943708109e797a1e38d72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page