End-to-end partial-weight transfer pipeline.

These details have not been verified by PyPI

Project links

Project description

ModelPulse 🚀

End-to-end partial-weight transfer pipeline for edge LLM inference.

ModelPulse enables a unique "Zero-Disk" inference strategy: Device A (Server) serves model shards over the network, while Device B (Client/Bridge) reconstructs the model entirely in RAM and runs inference via llama.cpp without ever writing the full GGUF to physical storage.

Data Flow Diagram

┌─────────────────────────────────────────────────────────────┐
│                     Server (Device A)                       │
│                  FastAPI @ 0.0.0.0:8000                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  WebSocket /ws (Control Plane)   HTTP (Data Plane)          │
│  ├─ MODEL_READY                  ├─ GET /manifest           │
│  ├─ PING/PONG                    ├─ GET /shards/*           │
│  ├─ METRICS                      └─ POST /metrics           │
│  └─ ACK/BYE                                                 │
│                                                             │
│  /models/upload (Multipart)                                 │
│  └─ Accept manifest.json + *.shard files                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘
         ↑                              ↑
         │                              │
         │ WS connect                   │ HTTP GET/POST
         │ + MODEL_READY signal         │ + shard stream
         │                              │
┌────────┴──────────────────────────────┴─────────────────────┐
│                   Client (Device B)                         │
│                       Bridge CLI                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Connect WebSocket → Send HELLO                          │
│  2. Receive MODEL_READY → Fetch manifest (HTTP)             │
│  3. Download shards (HTTP streaming)                        │
│  4. Assemble GGUF in /dev/shm                               │
│  5. Load with llama.cpp                                     │
│  6. Run inference                                           │
│  7. Send METRICS → Wait for next MODEL_READY signal         │
│     (event-driven — no polling, no restart required)        │ 
│                                                             │
└─────────────────────────────────────────────────────────────┘

✨ Key Features

🛡️ Zero-Disk Strategy: Models are assembled in tmpfs (/dev/shm), ensuring no persistent GGUF footprint on the client's disk.
🔄 Dynamic Model Swapping: Upload new models to the server at runtime; connected clients automatically unload, pull, and reload the new model without a restart.
⚡ Delta Updates (New!): Update only the changed tensors in a model. The bridge patches its in-memory GGUF in real-time, downloading only a fraction of the full model size.
📊 Real-time Telemetry: Detailed inference metrics (TTFT, tok/s, RAM delta, CPU temp) are streamed back to the server for centralized monitoring.
🛠️ Integrated Benchmarking: Built-in suite to stress-test edge devices and validate performance across different quantization levels.
🌐 Network Agnostic: Works seamlessly over local networks, Tailscale, or any HTTP/WS-capable connection.

📦 Installation

Install ModelPulse from PyPI:

pip install modelpulse

Alternatively, install directly from the repository for the latest dev features:

pip install git+https://github.com/MdSufiyan005/ModelPulse.git

Note: Ensure you have llama-cpp-python dependencies installed on your system (e.g., build-essential, python3-dev).

🔄 Workflow

1. Prepare Shards

Convert a monolithic .gguf file into a shard directory:

modelpulse server convert my_model.gguf ./my-shards/

2. Start the Server

Start the control plane on Device A. Use --log-dir to specify where inference metrics are saved.

modelpulse server run --host 0.0.0.0 --port 8000 --log-dir ./results

3. Run the Bridge

Connect your edge device to the server. It will wait for a model to be assigned.

modelpulse bridge run http://<server-ip>:8000

4. Dynamic Upload

Upload your prepared shards to the server. All connected bridges will instantly receive the update.

# Full Baseline Upload
modelpulse server upload "qwen-3.5-2b" "./my-shards/"

# Delta Update (Auto-Diff)
modelpulse server upload "qwen-3.5-2b-v2" "./new-shards/" --base "qwen-3.5-2b" --base-dir "./old-shards/"

📋 Command Reference

`modelpulse server run`

Start the FastAPI control plane.

Option	Default	Description
`--shard-dir`, `-d`	`./models-storage`	Root directory for model storage
`--host`	`127.0.0.1`	Bind address
`--port`	`8000`	Listening port
`--log-dir`	Current directory	Directory to save `metrics.jsonl`

`modelpulse server upload`

Upload models or delta patches to the control plane.

Option	Default	Description
`model_id`	(Required)	Unique slug for the new model
`paths`	(Required)	Shard directory or list of .shard files
`--base`	`None`	Base model ID for delta update
`--base-dir`	`None`	Local directory of base model for auto-diff
`--server`	`http://127.0.0.1:8000`	Target server URL

`modelpulse server convert`

Convert a monolithic GGUF file into tensor-level shards.

Argument	Description
`gguf_path`	Path to the monolithic .gguf file
`output_dir`	Directory to store the generated shards

`modelpulse bridge run`

Connect to a server and enter the inference loop.

Option	Default	Description
`host`	(Required)	Server URL (e.g., `http://100.64.0.5:8000`)
`--prompt`	`None` (listen mode)	Send a single prompt then wait for further updates
`--benchmark`, `-b`	`false`	Run the standard benchmark suite
`--max-tokens`, `-m`	`256`	Token generation limit
`--temperature`	`0.7`	Sampling temperature
`--n-ctx`	`2048`	Context window size
`--perplexity`, `-p`	`false`	Compute perplexity score during benchmark

📁 Project Layout

modelpulse/
├── modelpulse/                 # Core package
│   ├── server/
│   │   ├── app.py              # FastAPI application factory & routes
│   │   ├── cli.py              # Typer CLI commands (run, upload, convert)
│   │   ├── connection.py       # WebSocket connection management
│   │   ├── helpers.py          # SHA-256 & Fast-ID utilities
│   │   ├── server.py           # Compatibility shim (re-exports CLI)
│   │   └── sharder/            # GGUF processing utilities
│   │       ├── converter.py    # GGUF → Shard converter (tensor-level)
│   │       └── parser.py       # Low-level GGUF binary reader (v1/2/3)
│   ├── client/                 # Bridge (Device B) logic
│   │   ├── cli.py              # Terminal UI & inference loop
│   │   ├── bridge.py           # RAM GGUF assembly & llama.cpp loading
│   │   ├── shard_client.py     # Async HTTP downloader for shards
│   │   └── benchmarks.py       # Built-in performance testing suite
│   ├── shared/                 # Cross-component protocol definitions
│   │   ├── ws_protocol.py      # WebSocket message schemas
│   │   └── models.py           # ShardManifest & InferenceMetrics models
│   └── main.py                 # Unified CLI entry point
├── tools/                      # Legacy pre-refactor scripts (not used by the package)
├── TEST_WORKFLOW.md            # Step-by-step end-to-end testing guide
└── pyproject.toml              # Project metadata & dependencies

💾 The Zero-Disk Strategy

ModelPulse leverages the Linux tmpfs (RAM-backed filesystem) to satisfy llama.cpp's requirement for a file path while keeping the actual data off physical storage:

Pull: Bridge fetches manifest.json.
Stream: Bridge pulls .shard files (tensor by tensor) into memory.
Assemble: Bridge calculates GGUF layout and writes bytes to /dev/shm/sb_<pid>.gguf.
Load: llama-cpp-python loads the model via mmap from the RAM-backed file.
Clean: Once the model is unloaded, the virtual file is unlinked and memory is reclaimed.

📡 Networking (Tailscale, ZeroTier, Cloudflare)

For easy cross-device connectivity without port forwarding:

1. ZeroTier (Recommended for Large Models)

ZeroTier creates a virtual LAN with no payload limits, making it ideal for multi-GB model uploads.

# Connect Bridge to Server's ZeroTier IP
modelpulse bridge run http://10.147.17.100:8000 --benchmark

2. Tailscale

Standard virtual private networking:

# Get IP on Server
tailscale ip  # e.g., 100.66.170.100

# Connect Bridge
modelpulse bridge run http://100.66.170.100:8000

3. Cloudflare Tunnel

Good for public access, but note the 100MB upload limit on the free tier which may affect server upload commands.

# Connect Bridge to public tunnel URL
modelpulse bridge run https://modelpulse.your-domain.com

Built with ❤️ for Edge AI and Decentralized Inference.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.5

Apr 26, 2026

0.3.4

Apr 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

testing_modelpulse-0.3.5.tar.gz (43.2 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

testing_modelpulse-0.3.5-py3-none-any.whl (42.0 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file testing_modelpulse-0.3.5.tar.gz.

File metadata

Download URL: testing_modelpulse-0.3.5.tar.gz
Upload date: Apr 26, 2026
Size: 43.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for testing_modelpulse-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`daba9ff3264fc96f547cfae6b21c03fed40a43727c30c1637fb5e2360812f5f2`
MD5	`4eae171099fc8c269dfcb668d083004f`
BLAKE2b-256	`734c89119bedef4c453836fb791d8abb3bf8d2c4467c8f7d20086f7133fbfd5a`

See more details on using hashes here.

File details

Details for the file testing_modelpulse-0.3.5-py3-none-any.whl.

File metadata

Download URL: testing_modelpulse-0.3.5-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 42.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for testing_modelpulse-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e101d99f39d5561fa121d5dc90f727a69c45aba7459ba6bb6a34dd30ddb8240`
MD5	`844a4b5b900f7ebef56bfe53c2d54f51`
BLAKE2b-256	`36622c892a226b30c25a318a6581fe093421cee4431a63d109c51b5bc780fa9a`

See more details on using hashes here.

testing-modelpulse 0.3.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ModelPulse 🚀

Data Flow Diagram

✨ Key Features

📦 Installation

🔄 Workflow

1. Prepare Shards

2. Start the Server

3. Run the Bridge

4. Dynamic Upload

📋 Command Reference

modelpulse server run

modelpulse server upload

modelpulse server convert

modelpulse bridge run

📁 Project Layout

💾 The Zero-Disk Strategy

📡 Networking (Tailscale, ZeroTier, Cloudflare)

1. ZeroTier (Recommended for Large Models)

2. Tailscale

3. Cloudflare Tunnel

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`modelpulse server run`

`modelpulse server upload`

`modelpulse server convert`

`modelpulse bridge run`