tsugiai-mend-sdk. Maximum-uplift cross-rack distributed-training reducer built on Decoupled DiLoCo + DES-LOC + async tensor parallelism + FALCON fail-slow mitigation. Patent-independent by deliberate construction; see LICENSE preamble.
Project description
tsugiai-mend-sdk
Maximum-uplift cross-rack distributed-training reducer for PyTorch.
A software-only drop-in that replaces the cross-rack data-parallel all-reduce with an integration of public-art techniques designed to maximize measured tokens-per-second uplift on realistic cross-rack topology:
- Decoupled DiLoCo (Douillard et al., arXiv:2604.21428, April 2026): independent asynchronous learners, minimum quorum, adaptive grace window, token-weighted merge.
- DES-LOC / Local Adam (Iacob et al., arXiv:2505.22549, May 2025; ICLR 2026): desynchronized synchronization periods for adaptive-optimizer momenta.
- Async tensor parallelism (PyTorch / TorchTitan, September 2024): intra-node overlap of all-gather / reduce-scatter with matmul.
- FALCON fail-slow mitigation (arXiv:2410.12588, October 2024): sliding-window z-score detection of slow nodes to exclude them from the current quorum round.
The SDK keeps intra-rack TP / CP / PP / FSDP collectives unchanged (vanilla NCCL inside the NVLink domain) and replaces only the off-rack DP boundary with the four-paper stack above.
License and IP posture
This SDK is licensed under Apache-2.0 with its full automatic patent grant. The SDK is patent-independent by deliberate construction: it does not exercise the K-Pool LoRA (US App. 64/060,315) or Infinity (US App. 64/055,093) patent estates that the companion SDK (tsugiai-kpool-sdk, also Apache-2.0) does. Read the preamble at the top of the LICENSE file for the full posture explanation.
The companion patent-aligned SDK at github.com/tsugiai/tsugi-kpool is the reduction-to-practice artifact for those two TsugiCinema patent estates. The two SDKs share zero code.
Headline measurement
| Workload | Hardware | Measurement | Reference |
|---|---|---|---|
| Statistical-confidence headline (Hopper 3-seed CI) | Modal H100:1, Qwen-2.5-1.5B, 200 steps × 3 seeds at 2000ms | +71.49% ± 2.83% (95% CI) throughput uplift | docs/phase2_multi_seed_ci_results.md |
| Cross-rack grace-window overlap on Hopper at 7B production scale | Modal H100:1, Qwen-2.5-7B, 200 steps × 5 delays | +76.58% at 2000ms; +39.72% at 1000ms; +19.20% at 500ms | docs/phase2_delay_sweep_qwen7b_results.md |
| Intermediate model-scale (3B) confirmation (Hopper 3-seed CI) | Modal H100:1, Qwen-2.5-3B, 200 steps × 4 delays × 3 seeds | +41.31% ± 0.29% at 2000ms (n=3); +20.40% ± 0.03% at 1000ms; +9.75% ± 0.04% at 500ms | Continuation doc 2026-05-23 |
| Cross-rack grace-window overlap on Hopper at 1.5B model scale | Modal H100:1, Qwen-2.5-1.5B, 200 steps × 7 delays | +70.64% at 2000ms; +34.73% at 1000ms; +16.86% at 500ms; concurrent path constant ~30,840 tok/s across all delays | docs/phase2_delay_sweep_results.md |
| Cross-rack grace-window overlap on A10G | Modal A10G, SmolLM-135M, 200 steps × 7 delays | +52.75% at 2000ms (constant); +11.61% at 500ms; -0.06% overhead at 0ms; bit-exact loss preserved across all cells | docs/phase2_delay_sweep_results.md |
| A10G delay-distribution stress-test sweep | Modal A10G, SmolLM-135M, 200 steps × 4 delays × 3 distributions | constant +52.53% / bimodal-stress +8.29% / long-tail-stress +19.44% at delay=2000ms (parameters NOT FALCON-anchored; see briefs 06/07/08 in MasterVision) | Continuation doc 2026-05-23 |
| Production-realistic multi-GPU FSDP + 7B model (3-seed CI) | Modal 8xH100 FSDP FULL_SHARD, Qwen-2.5-7B + simulated 2-rack, 4 delays × 3 seeds | +6.37% ± 1.31% at delay=2000ms (n=3; replaces the prior +7.36% single-seed) | Continuation doc 2026-05-23 |
| Multi-GPU FSDP smaller model | Modal 8xH100 FSDP, Qwen-2.5-1.5B + simulated 2-rack, 7 delays | +3.08% at 2000ms (synthetic 8xGPU floor) | docs/stage_c_phase2_delay_sweep_results.md |
| Real cross-network 2-node 8xV100 (synchronous reducer; preserved baseline) | Lambda Labs commodity Ethernet, SmolLM-135M, 500 paired steps | +28.58% tokens-per-second uplift vs vanilla FSDP, bit-exact-identical loss | docs/stage_d_proper_results.md |
| H100 Hopper single-instance (synchronous reducer baseline) | Modal 8x H100 SXM5, Llama-3-8B, 2000 paired steps × 3 seeds | -0.97% ± 1.5% (predicted null; Hopper NVLink absorbs the synchronous-path cross-rack tax) | docs/stage_e_results.md |
| H100 Hopper single-instance + concurrent orchestrator (Track A, 3-seed CI) | Modal 8x H100 SXM5, Llama-3-8B, 200 paired steps × 3 seeds at delay=2000ms | +2.88% ± 0.43% (n=3) throughput uplift; loss equivalence preserved (sync 0.288, conc 0.278) | Continuation doc 2026-05-23 |
| Item E DeepSpeed ZeRO-3 head-to-head measured orthogonality | Modal 8x H100 SXM5, Qwen-2.5-7B + DeepSpeed ZeRO-3 (stage=3, overlap_comm=True), 200 paired steps × 3 seeds at delay=2000ms | +29.6% concurrent vs sync (n=3 paired; per-seed jitter <1pp); concurrent +0.44% vs no-delay baseline. DeepSpeed's intra-iteration overlap cannot recover the outer-step wait; Mend's outer-step concurrent_outer_step recovers essentially all of it | (internal benchmark report) |
| Real cross-network 2-pod 8xH100 (Stage E-prime; production-fabric floor; n=1 paired) | RunPod 2x 8x H100 SXM5 over real InfiniBand or RoCE v2 3.2 Tbps (AP-IN-1), Llama-3-8B, 500 paired steps × 1 seed | +1.42% tps + 0.18% loss delta (effectively bit-exact); production-fabric floor anchor; n=1 caveat is load-bearing (same order of magnitude as +1.40% baseline-only seed variance); n=3 CI pending | docs/stage_e_prime_results.md |
The +71.49% ± 2.83% (95% CI; n=3) Hopper headline is the canonical enterprise cross-rack DD-grade measurement. The orchestrator's uplift is governed by N * T_step / G (sync_period_steps × per-step compute time, vs grace_window_ms); apparent "non-monotonicity with model size" at H100:1 (Qwen-3B +41.31% sits below both Qwen-1.5B +71.49% and Qwen-7B +76.58%) is fully explained by the Qwen-7B measurement using 1/8 the tokens-per-step (seq_len=1024 mbs=1 vs seq_len=2048 mbs=4). At fixed tokens-per-step, uplift is monotonically decreasing in model size. The analytical model and the proposed N* = ceil(G / T_step) auto-tuner spec are documented in an internal companion brief on uplift-surface characterization. Production-realistic multi-GPU FSDP yields a smaller honest floor (+6.37% ± 1.31% n=3 paired at delay=2000ms; replaces the prior +7.36% single-seed) because 8-rank NCCL pipelining absorbs some of the synthetic delay; the real Stage D-proper measurement on actual cross-network would lift this number back up.
Delay-distribution stress-test disclosure (Track D 2026-05-23): the constant-delay headlines (e.g., +52.53% A10G at delay=2000ms) are ceiling-case stress tests. The bimodal and log-normal variants are alternative stress-test shapes whose parameters are NOT directly FALCON-anchored: FALCON Table 2 reports only inter-node RDMA CoV=0.29 as the quantitative variance number; it does NOT publish per-iteration percentile breakdowns or bimodal characterizations. The current bimodal (80/20 at 50ms/base) delivers +8.29% on the same A10G + SmolLM-135M workload; the current long_tail (sigma=1.0) delivers +19.44%. A FALCON-CoV-anchored re-tune (95/5 bimodal, sigma~0.285 log-normal) is proposed in an internal FALCON-distribution-verification brief. Until that re-measurement lands, the +28.58% Stage D-proper Lambda V100 cross-network result remains the most defensible production-grounded headline.
At every scale, the concurrent path's throughput is rock-solid across all delays (Qwen-7B single-process: 4,300 ± 80 tok/s; Qwen-3B Hopper: 18,153 ± 4 tok/s; Qwen-1.5B Hopper: 30,840 ± 35 tok/s; SmolLM-135M A10G: 23,610 ± 80 tok/s); the synchronous baseline collapses linearly. The FALCON paper (arXiv:2410.12588) documents cross-rack inter-node RDMA variance (CoV=0.29) but does not characterize the per-iteration latency distribution shape that this sweep parameterizes; the delay-sweep is a stress-test, not a literal FALCON-replay.
Status
Stage A through Stage E-prime all PASS. Phase 2 Week 1 (concurrent async-TP overlap with cross-rack reducer) shipped with the ConcurrentOuterStep orchestrator integrated. Stage D-proper for Hopper-cross-network real-fabric is point-estimate closed via Stage E-prime (RunPod 2x 8x H100 InfiniBand 3.2 Tbps, n=1 paired, +1.42% production-fabric floor); n=3 CI pending. Item E head-to-head against DeepSpeed ZeRO-3 confirms orthogonality at +29.6% concurrent vs sync on a Tier-1 hyperscaler stack. See docs/60_day_plan.md for the Phase 2 nine-week sprint.
Quickstart
from tsugi_mend import MendConfig, mend_init, mend_shutdown
config = MendConfig(
quorum_min_learners=4,
grace_window_ms=2000,
token_weighted_merge=True,
sync_period_steps=128,
momentum_sync_period_steps=512,
async_tp_enabled=True,
# Phase 2 Week 1 (2026-05-22): orchestrator overlaps the cross-rack
# outer-step wait with inner-step async-TP compute. Default True.
# See docs/phase2_delay_sweep_results.md for the +52.75% measurement.
concurrent_outer_step=True,
failslow_zscore_threshold=3.0,
failslow_window_steps=50,
rack_aware=True,
sideband_addr="tcp://0.0.0.0:51900",
sideband_peers=("tcp://peer1:51900", "tcp://peer2:51900"),
sideband_heartbeat_ms=100,
diagnostics_dir="./results/mend_diag",
)
mend_init(model, config)
# ... train normally ...
mend_shutdown(model)
Layout
src/tsugi_mend/ SDK source
tests/ Stage A unit and integration tests (CPU-only)
benchmarks/ Stage B/C/D/E launch scripts (cloud-gated)
docs/ architecture, benchmark protocol, convergence-equivalence sketch
examples/ minimal training-loop integration examples
scripts/ utility scripts (env audit, cost estimator)
Companion SDK
For LoRA-adapter-granularity productization that exercises the K-Pool LoRA and Infinity patent estates, see tsugiai-kpool-sdk. The two SDKs serve different acquirer-due-diligence legs:
- Patent moat leg (kpool): the IP that goes into the Definitive Agreement's assignment schedule.
- Operational uplift leg (max): the engineering artifact a partner can run Monday morning on their cluster.
Both legs are independent. Either can stand on its own.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tsugi_mend-0.1.0.tar.gz.
File metadata
- Download URL: tsugi_mend-0.1.0.tar.gz
- Upload date:
- Size: 72.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63fe138c22a77b47820e9b769af08ad9bb03f343a1c5863070514773738fb381
|
|
| MD5 |
aaf3f15727c33177b8e768ee4cf572c8
|
|
| BLAKE2b-256 |
71a98f9b33bbb2415a0bbd558cc651a808a0d8624293a245b4df5b0a5749effa
|
File details
Details for the file tsugi_mend-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tsugi_mend-0.1.0-py3-none-any.whl
- Upload date:
- Size: 50.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31ba17bda0dad80365e406d2fe7b5ec41806525e2fa625503048b9aae274ef8e
|
|
| MD5 |
641d78b08ce529d5a24bafb1a6af72d2
|
|
| BLAKE2b-256 |
75ac4ae53a49440755e968146eb1ed5abaf51776cfed6b6a200d0cee04a73ddc
|