Skip to main content

tsugiai-mend-sdk. Maximum-uplift cross-rack distributed-training reducer built on Decoupled DiLoCo + DES-LOC + async tensor parallelism + FALCON fail-slow mitigation. Patent-independent by deliberate construction; see LICENSE preamble.

Project description

tsugiai-mend-sdk

Maximum-uplift cross-rack distributed-training reducer for PyTorch.

A software-only drop-in that replaces the cross-rack data-parallel all-reduce with an integration of public-art techniques designed to maximize measured tokens-per-second uplift on realistic cross-rack topology:

  • Decoupled DiLoCo (Douillard et al., arXiv:2604.21428, April 2026): independent asynchronous learners, minimum quorum, adaptive grace window, token-weighted merge.
  • DES-LOC / Local Adam (Iacob et al., arXiv:2505.22549, May 2025; ICLR 2026): desynchronized synchronization periods for adaptive-optimizer momenta.
  • Async tensor parallelism (PyTorch / TorchTitan, September 2024): intra-node overlap of all-gather / reduce-scatter with matmul.
  • FALCON fail-slow mitigation (arXiv:2410.12588, October 2024): sliding-window z-score detection of slow nodes to exclude them from the current quorum round.

The SDK keeps intra-rack TP / CP / PP / FSDP collectives unchanged (vanilla NCCL inside the NVLink domain) and replaces only the off-rack DP boundary with the four-paper stack above.

License and IP posture

This SDK is licensed under Apache-2.0 with its full automatic patent grant. The SDK is patent-independent by deliberate construction: it does not exercise the K-Pool LoRA (US App. 64/060,315) or Infinity (US App. 64/055,093) patent estates that the companion SDK (tsugiai-kpool-sdk, also Apache-2.0) does. Read the preamble at the top of the LICENSE file for the full posture explanation.

The companion patent-aligned SDK at github.com/tsugiai/tsugi-kpool is the reduction-to-practice artifact for those two TsugiCinema patent estates. The two SDKs share zero code.

Headline measurement

Workload Hardware Measurement Reference
Statistical-confidence headline (Hopper 3-seed CI) Modal H100:1, Qwen-2.5-1.5B, 200 steps × 3 seeds at 2000ms +71.49% ± 2.83% (95% CI) throughput uplift docs/phase2_multi_seed_ci_results.md
Cross-rack grace-window overlap on Hopper at 7B production scale Modal H100:1, Qwen-2.5-7B, 200 steps × 5 delays +76.58% at 2000ms; +39.72% at 1000ms; +19.20% at 500ms docs/phase2_delay_sweep_qwen7b_results.md
Intermediate model-scale (3B) confirmation (Hopper 3-seed CI) Modal H100:1, Qwen-2.5-3B, 200 steps × 4 delays × 3 seeds +41.31% ± 0.29% at 2000ms (n=3); +20.40% ± 0.03% at 1000ms; +9.75% ± 0.04% at 500ms Continuation doc 2026-05-23
Cross-rack grace-window overlap on Hopper at 1.5B model scale Modal H100:1, Qwen-2.5-1.5B, 200 steps × 7 delays +70.64% at 2000ms; +34.73% at 1000ms; +16.86% at 500ms; concurrent path constant ~30,840 tok/s across all delays docs/phase2_delay_sweep_results.md
Cross-rack grace-window overlap on A10G Modal A10G, SmolLM-135M, 200 steps × 7 delays +52.75% at 2000ms (constant); +11.61% at 500ms; -0.06% overhead at 0ms; bit-exact loss preserved across all cells docs/phase2_delay_sweep_results.md
A10G delay-distribution stress-test sweep Modal A10G, SmolLM-135M, 200 steps × 4 delays × 3 distributions constant +52.53% / bimodal-stress +8.29% / long-tail-stress +19.44% at delay=2000ms (parameters NOT FALCON-anchored; see briefs 06/07/08 in MasterVision) Continuation doc 2026-05-23
Production-realistic multi-GPU FSDP + 7B model (3-seed CI) Modal 8xH100 FSDP FULL_SHARD, Qwen-2.5-7B + simulated 2-rack, 4 delays × 3 seeds +6.37% ± 1.31% at delay=2000ms (n=3; replaces the prior +7.36% single-seed) Continuation doc 2026-05-23
Multi-GPU FSDP smaller model Modal 8xH100 FSDP, Qwen-2.5-1.5B + simulated 2-rack, 7 delays +3.08% at 2000ms (synthetic 8xGPU floor) docs/stage_c_phase2_delay_sweep_results.md
Real cross-network 2-node 8xV100 (synchronous reducer; preserved baseline) Lambda Labs commodity Ethernet, SmolLM-135M, 500 paired steps +28.58% tokens-per-second uplift vs vanilla FSDP, bit-exact-identical loss docs/stage_d_proper_results.md
H100 Hopper single-instance (synchronous reducer baseline) Modal 8x H100 SXM5, Llama-3-8B, 2000 paired steps × 3 seeds -0.97% ± 1.5% (predicted null; Hopper NVLink absorbs the synchronous-path cross-rack tax) docs/stage_e_results.md
H100 Hopper single-instance + concurrent orchestrator (Track A, 3-seed CI) Modal 8x H100 SXM5, Llama-3-8B, 200 paired steps × 3 seeds at delay=2000ms +2.88% ± 0.43% (n=3) throughput uplift; loss equivalence preserved (sync 0.288, conc 0.278) Continuation doc 2026-05-23
Item E DeepSpeed ZeRO-3 head-to-head measured orthogonality Modal 8x H100 SXM5, Qwen-2.5-7B + DeepSpeed ZeRO-3 (stage=3, overlap_comm=True), 200 paired steps × 3 seeds at delay=2000ms +29.6% concurrent vs sync (n=3 paired; per-seed jitter <1pp); concurrent +0.44% vs no-delay baseline. DeepSpeed's intra-iteration overlap cannot recover the outer-step wait; Mend's outer-step concurrent_outer_step recovers essentially all of it (internal benchmark report)
Real cross-network 2-pod 8xH100 (Stage E-prime; production-fabric floor; n=1 paired) RunPod 2x 8x H100 SXM5 over real InfiniBand or RoCE v2 3.2 Tbps (AP-IN-1), Llama-3-8B, 500 paired steps × 1 seed +1.42% tps + 0.18% loss delta (effectively bit-exact); production-fabric floor anchor; n=1 caveat is load-bearing (same order of magnitude as +1.40% baseline-only seed variance); n=3 CI pending docs/stage_e_prime_results.md

The +71.49% ± 2.83% (95% CI; n=3) Hopper headline is the canonical enterprise cross-rack DD-grade measurement. The orchestrator's uplift is governed by N * T_step / G (sync_period_steps × per-step compute time, vs grace_window_ms); apparent "non-monotonicity with model size" at H100:1 (Qwen-3B +41.31% sits below both Qwen-1.5B +71.49% and Qwen-7B +76.58%) is fully explained by the Qwen-7B measurement using 1/8 the tokens-per-step (seq_len=1024 mbs=1 vs seq_len=2048 mbs=4). At fixed tokens-per-step, uplift is monotonically decreasing in model size. The analytical model and the proposed N* = ceil(G / T_step) auto-tuner spec are documented in an internal companion brief on uplift-surface characterization. Production-realistic multi-GPU FSDP yields a smaller honest floor (+6.37% ± 1.31% n=3 paired at delay=2000ms; replaces the prior +7.36% single-seed) because 8-rank NCCL pipelining absorbs some of the synthetic delay; the real Stage D-proper measurement on actual cross-network would lift this number back up.

Delay-distribution stress-test disclosure (Track D 2026-05-23): the constant-delay headlines (e.g., +52.53% A10G at delay=2000ms) are ceiling-case stress tests. The bimodal and log-normal variants are alternative stress-test shapes whose parameters are NOT directly FALCON-anchored: FALCON Table 2 reports only inter-node RDMA CoV=0.29 as the quantitative variance number; it does NOT publish per-iteration percentile breakdowns or bimodal characterizations. The current bimodal (80/20 at 50ms/base) delivers +8.29% on the same A10G + SmolLM-135M workload; the current long_tail (sigma=1.0) delivers +19.44%. A FALCON-CoV-anchored re-tune (95/5 bimodal, sigma~0.285 log-normal) is proposed in an internal FALCON-distribution-verification brief. Until that re-measurement lands, the +28.58% Stage D-proper Lambda V100 cross-network result remains the most defensible production-grounded headline.

At every scale, the concurrent path's throughput is rock-solid across all delays (Qwen-7B single-process: 4,300 ± 80 tok/s; Qwen-3B Hopper: 18,153 ± 4 tok/s; Qwen-1.5B Hopper: 30,840 ± 35 tok/s; SmolLM-135M A10G: 23,610 ± 80 tok/s); the synchronous baseline collapses linearly. The FALCON paper (arXiv:2410.12588) documents cross-rack inter-node RDMA variance (CoV=0.29) but does not characterize the per-iteration latency distribution shape that this sweep parameterizes; the delay-sweep is a stress-test, not a literal FALCON-replay.

Status

Stage A through Stage E-prime all PASS. Phase 2 Week 1 (concurrent async-TP overlap with cross-rack reducer) shipped with the ConcurrentOuterStep orchestrator integrated. Stage D-proper for Hopper-cross-network real-fabric is point-estimate closed via Stage E-prime (RunPod 2x 8x H100 InfiniBand 3.2 Tbps, n=1 paired, +1.42% production-fabric floor); n=3 CI pending. Item E head-to-head against DeepSpeed ZeRO-3 confirms orthogonality at +29.6% concurrent vs sync on a Tier-1 hyperscaler stack. See docs/60_day_plan.md for the Phase 2 nine-week sprint.

Quickstart

from tsugi_mend import MendConfig, mend_init, mend_shutdown

config = MendConfig(
    quorum_min_learners=4,
    grace_window_ms=2000,
    token_weighted_merge=True,
    sync_period_steps=128,
    momentum_sync_period_steps=512,
    async_tp_enabled=True,
    # Phase 2 Week 1 (2026-05-22): orchestrator overlaps the cross-rack
    # outer-step wait with inner-step async-TP compute. Default True.
    # See docs/phase2_delay_sweep_results.md for the +52.75% measurement.
    concurrent_outer_step=True,
    failslow_zscore_threshold=3.0,
    failslow_window_steps=50,
    rack_aware=True,
    sideband_addr="tcp://0.0.0.0:51900",
    sideband_peers=("tcp://peer1:51900", "tcp://peer2:51900"),
    sideband_heartbeat_ms=100,
    diagnostics_dir="./results/mend_diag",
)

mend_init(model, config)
# ... train normally ...
mend_shutdown(model)

Layout

src/tsugi_mend/        SDK source
tests/                Stage A unit and integration tests (CPU-only)
benchmarks/           Stage B/C/D/E launch scripts (cloud-gated)
docs/                 architecture, benchmark protocol, convergence-equivalence sketch
examples/             minimal training-loop integration examples
scripts/              utility scripts (env audit, cost estimator)

Companion SDK

For LoRA-adapter-granularity productization that exercises the K-Pool LoRA and Infinity patent estates, see tsugiai-kpool-sdk. The two SDKs serve different acquirer-due-diligence legs:

  • Patent moat leg (kpool): the IP that goes into the Definitive Agreement's assignment schedule.
  • Operational uplift leg (max): the engineering artifact a partner can run Monday morning on their cluster.

Both legs are independent. Either can stand on its own.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tsugi_mend-0.1.0.tar.gz (72.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tsugi_mend-0.1.0-py3-none-any.whl (50.2 kB view details)

Uploaded Python 3

File details

Details for the file tsugi_mend-0.1.0.tar.gz.

File metadata

  • Download URL: tsugi_mend-0.1.0.tar.gz
  • Upload date:
  • Size: 72.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tsugi_mend-0.1.0.tar.gz
Algorithm Hash digest
SHA256 63fe138c22a77b47820e9b769af08ad9bb03f343a1c5863070514773738fb381
MD5 aaf3f15727c33177b8e768ee4cf572c8
BLAKE2b-256 71a98f9b33bbb2415a0bbd558cc651a808a0d8624293a245b4df5b0a5749effa

See more details on using hashes here.

File details

Details for the file tsugi_mend-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tsugi_mend-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 50.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tsugi_mend-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 31ba17bda0dad80365e406d2fe7b5ec41806525e2fa625503048b9aae274ef8e
MD5 641d78b08ce529d5a24bafb1a6af72d2
BLAKE2b-256 75ac4ae53a49440755e968146eb1ed5abaf51776cfed6b6a200d0cee04a73ddc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page