tsugiai-mend-sdk. Maximum-uplift cross-rack distributed-training reducer built on Decoupled DiLoCo + DES-LOC + async tensor parallelism + FALCON fail-slow mitigation. Patent-independent by deliberate construction; see LICENSE preamble.

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

tsugi

These details have not been verified by PyPI

Project links

Project description

tsugi-mend

Maximum-uplift cross-rack distributed-training reducer for PyTorch.

A software-only drop-in that replaces the cross-rack data-parallel all-reduce with an integration of public-art techniques designed to maximize measured tokens-per-second uplift on realistic cross-rack topology:

Decoupled DiLoCo (Douillard et al., arXiv:2604.21428, April 2026): independent asynchronous learners, minimum quorum, adaptive grace window, token-weighted merge.
DES-LOC / Local Adam (Iacob et al., arXiv:2505.22549, May 2025; ICLR 2026): desynchronized synchronization periods for adaptive-optimizer momenta.
Async tensor parallelism (PyTorch / TorchTitan, September 2024): intra-node overlap of all-gather / reduce-scatter with matmul.
FALCON fail-slow mitigation (arXiv:2410.12588, October 2024): sliding-window z-score detection of slow nodes to exclude them from the current quorum round.

The SDK keeps intra-rack TP / CP / PP / FSDP collectives unchanged (vanilla NCCL inside the NVLink domain) and replaces only the off-rack DP boundary with the four-paper stack above.

Install

pip install tsugi-mend

Or install the unified surface that bundles this SDK with the companion patent-aligned SDK:

pip install tsugi   # exposes tsugi.mend and tsugi.kpool

For local development:

pip install -e ".[dev]"

License and IP posture

This SDK is licensed under Apache-2.0 with its full automatic patent grant. The SDK is patent-independent by deliberate construction: it does not exercise the K-Pool LoRA (US App. 64/060,315) or Infinity (US App. 64/055,093) patent estates that the companion SDK (tsugi-kpool, also Apache-2.0) does. Read the preamble at the top of the LICENSE file for the full posture explanation.

The companion patent-aligned SDK at github.com/tsugiai/tsugi-kpool is the software embodiment of those two TsugiCinema patent estates. The two SDKs share zero code.

Measurements

The numbers below are first-party internal benchmark measurements taken under the reproduction contract in docs/benchmark_protocol.md (same workload / checkpoint / hardware, baseline vs SDK, paired runs, bootstrap 95% CI). The raw per-run results logs are internal; the protocol document is the public reproduction pointer, and the headline cells below can be re-derived by anyone who runs the protocol on the stated hardware. We report point estimates with their 95% CI where available and flag single-seed (n=1) cells explicitly.

Production-grounded results

The most defensible headline is the real 2-node cross-network V100 measurement: actual off-rack traffic, not an injected delay, with bit-exact loss. The adjacent multi-GPU FSDP result is the realistic floor to keep beside that headline.

Workload	Hardware	Measurement
Real cross-network 2-node 8xV100 (synchronous reducer)	Lambda Labs commodity Ethernet, SmolLM-135M, 500 paired steps	+28.58% tokens-per-second uplift vs vanilla FSDP, bit-exact-identical loss
Production-realistic multi-GPU FSDP + 7B model (realistic floor, 3-seed CI)	Modal 8xH100 FSDP FULL_SHARD, Qwen-2.5-7B + simulated 2-rack, 4 delays × 3 seeds	+6.37% ± 1.31% at 2000ms (n=3)
H100 Hopper single-instance (synchronous reducer baseline)	Modal 8x H100 SXM5, Llama-3-8B, 2000 paired steps × 3 seeds	-0.97% ± 1.5% (predicted null; Hopper NVLink absorbs the synchronous-path cross-rack tax)
Real cross-network 2-pod 8xH100 (production-fabric floor; n=1)	RunPod 2x 8x H100 SXM5 over real InfiniBand / RoCE v2 3.2 Tbps, Llama-3-8B, 500 paired steps × 1 seed	+1.42% tps + 0.18% loss delta (effectively bit-exact); n=1 caveat is load-bearing (same order of magnitude as baseline-only seed variance); n=3 CI pending

How to read the production-grounded numbers honestly:

The +28.58% real cross-network V100 result is the headline production-grounded number: it is a real 2-node cross-Ethernet measurement with bit-exact loss.
Production-realistic multi-GPU FSDP yields a smaller honest floor (+6.37% ± 1.31%, n=3) because 8-rank NCCL pipelining absorbs some of the simulated delay.
The real-fabric Hopper 2-pod InfiniBand / RoCE result remains a point estimate: n=1 caveat is load-bearing and the n=3 CI is pending.

Ceiling-case / simulated-delay results

Every cell in this subsection uses an injected simulated grace-window delay on a single instance or simulated two-rack setup, not a real cross-network measurement. These are ceiling-case stress tests for the overlap mechanism rather than production numbers.

Workload	Hardware	Measurement
Statistical-confidence ceiling case (Hopper 3-seed CI)	Modal H100:1, Qwen-2.5-1.5B, 200 steps × 3 seeds at 2000ms grace window	+71.49% ± 2.83% (95% CI, n=3) throughput uplift
Cross-rack grace-window overlap on Hopper at 7B scale	Modal H100:1, Qwen-2.5-7B, 200 steps × 5 delays	+76.58% at 2000ms; +39.72% at 1000ms; +19.20% at 500ms
Intermediate model-scale (3B) confirmation (Hopper 3-seed CI)	Modal H100:1, Qwen-2.5-3B, 200 steps × 4 delays × 3 seeds	+41.31% ± 0.29% at 2000ms (n=3); +20.40% ± 0.03% at 1000ms; +9.75% ± 0.04% at 500ms
Cross-rack grace-window overlap at 1.5B scale	Modal H100:1, Qwen-2.5-1.5B, 200 steps × 7 delays	+70.64% at 2000ms; +34.73% at 1000ms; +16.86% at 500ms
Cross-rack grace-window overlap on A10G	Modal A10G, SmolLM-135M, 200 steps × 7 delays	+52.75% at 2000ms (constant); +11.61% at 500ms; -0.06% overhead at 0ms; bit-exact loss preserved across all cells

How to read the ceiling-case numbers honestly:

The +71.49% ± 2.83% (n=3) Hopper result is a single-instance measurement with an injected simulated grace-window delay, not a real cross-network result. Read it as a ceiling-case for the overlap mechanism.
The orchestrator's uplift is governed by N · T_step / G (sync-period steps × per-step compute time vs grace-window ms). Apparent non-monotonicity with model size (Qwen-3B +41.31% below both Qwen-1.5B and Qwen-7B) is explained by the Qwen-7B measurement using 1/8 the tokens-per-step (seq_len 1024 / mbs 1 vs 2048 / 4); at fixed tokens-per-step, uplift is monotonically decreasing in model size.
Constant-delay headlines (e.g. +52.75% A10G at 2000ms) are ceiling-case stress tests. The FALCON paper documents cross-rack inter-node RDMA variance (CoV=0.29) but does not characterize the per-iteration latency distribution shape; the delay sweep is a stress test, not a literal FALCON replay.

At every scale the concurrent path's throughput is rock-solid across delays (Qwen-7B single-process: 4,300 ± 80 tok/s; Qwen-3B Hopper: 18,153 ± 4 tok/s; Qwen-1.5B Hopper: 30,840 ± 35 tok/s; SmolLM-135M A10G: 23,610 ± 80 tok/s) while the synchronous baseline collapses linearly with delay.

Run it multi-node

See docs/multinode.md for the multi-node launch walkthrough.

Status

Pre-Alpha (0.1.1). APIs are stabilizing and may change before v1.0. Published to PyPI as tsugi-mend; also reachable through the unified tsugi meta-package as tsugi.mend. The staged validation (Stage A unit/integration through cross-network production-fabric runs) all passed under the protocol above; the real-fabric Hopper cross-network result is point-estimate closed (n=1), with an n=3 CI pending.

Quickstart

from tsugi_mend import MendConfig, mend_init, mend_shutdown

config = MendConfig(
    quorum_min_learners=4,
    grace_window_ms=2000,
    token_weighted_merge=True,
    sync_period_steps=128,
    momentum_sync_period_steps=512,
    async_tp_enabled=True,
    # Orchestrator overlaps the cross-rack outer-step wait with inner-step
    # async-TP compute. Default True.
    concurrent_outer_step=True,
    failslow_zscore_threshold=3.0,
    failslow_window_steps=50,
    rack_aware=True,
    sideband_addr="tcp://0.0.0.0:51900",
    sideband_peers=("tcp://peer1:51900", "tcp://peer2:51900"),
    sideband_heartbeat_ms=100,
    diagnostics_dir="./results/mend_diag",
)

mend_init(model, config)
# ... train normally ...
mend_shutdown(model)

Two runnable, CPU-only integration examples (no GPU or multi-node required):

examples/minimal_single_process.py - smallest end-to-end use on a toy nn.Module.
examples/concurrent_orchestrator.py - wiring the ConcurrentOuterStep orchestrator into a training loop with a synthetic single-rank fragment provider.

python examples/minimal_single_process.py
python examples/concurrent_orchestrator.py

Layout

src/tsugi_mend/   SDK source
tests/            unit and integration tests (CPU-only)
docs/             architecture, benchmark protocol, convergence-equivalence sketch
examples/         minimal CPU-only training-loop integration examples

Companion SDK

For LoRA-adapter-granularity productization that exercises the K-Pool LoRA and Infinity patent estates, see tsugi-kpool. The two SDKs share zero code and can be installed and used independently, or together via the unified tsugi meta-package.

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

tsugi

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

May 29, 2026

This version

0.1.1

May 27, 2026

0.1.0

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tsugi_mend-0.1.1.tar.gz (87.2 kB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tsugi_mend-0.1.1-py3-none-any.whl (55.5 kB view details)

Uploaded May 27, 2026 Python 3

File details

Details for the file tsugi_mend-0.1.1.tar.gz.

File metadata

Download URL: tsugi_mend-0.1.1.tar.gz
Upload date: May 27, 2026
Size: 87.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for tsugi_mend-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d5c710da5774a96775f91ab3209bc5266f8e4d1b1847d042aded9e6102aacefc`
MD5	`761acd0bd8c4caef0a54933053fa5b47`
BLAKE2b-256	`765fc3516a7925558c95384f202b1f62cca35ab8fcaf65ff0ae796d05d44ff08`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tsugi_mend-0.1.1.tar.gz:

Publisher: release.yml on tsugiai/tsugi-mend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tsugi_mend-0.1.1.tar.gz
- Subject digest: d5c710da5774a96775f91ab3209bc5266f8e4d1b1847d042aded9e6102aacefc
- Sigstore transparency entry: 1647070003
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: tsugiai/tsugi-mend@caf5221d895777b25dbf399fb23844f8162c44bf
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/tsugiai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@caf5221d895777b25dbf399fb23844f8162c44bf
- Trigger Event: release

File details

Details for the file tsugi_mend-0.1.1-py3-none-any.whl.

File metadata

Download URL: tsugi_mend-0.1.1-py3-none-any.whl
Upload date: May 27, 2026
Size: 55.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for tsugi_mend-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5fd612beb0a9eb280b26e4146db661bd693f033883878a9f43d39fb461c23bb1`
MD5	`4ad2dae4d34e31b79c95d267682424d3`
BLAKE2b-256	`b00392a52dcffeeb221cad557171784c8309defc94f08750ff186c0aa8bc5e8a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tsugi_mend-0.1.1-py3-none-any.whl:

Publisher: release.yml on tsugiai/tsugi-mend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tsugi_mend-0.1.1-py3-none-any.whl
- Subject digest: 5fd612beb0a9eb280b26e4146db661bd693f033883878a9f43d39fb461c23bb1
- Sigstore transparency entry: 1647070101
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: tsugiai/tsugi-mend@caf5221d895777b25dbf399fb23844f8162c44bf
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/tsugiai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@caf5221d895777b25dbf399fb23844f8162c44bf
- Trigger Event: release

tsugi-mend 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tsugi-mend

Install

License and IP posture

Measurements

Production-grounded results

Ceiling-case / simulated-delay results

Run it multi-node

Status

Quickstart

Layout

Companion SDK

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance