tsugiai-mend-sdk. Maximum-uplift cross-rack distributed-training reducer built on Decoupled DiLoCo + DES-LOC + async tensor parallelism + FALCON fail-slow mitigation. Patent-independent by deliberate construction; see LICENSE preamble.

These details have not been verified by PyPI

Project links

Project description

tsugiai-mend-sdk

Maximum-uplift cross-rack distributed-training reducer for PyTorch.

A software-only drop-in that replaces the cross-rack data-parallel all-reduce with an integration of public-art techniques designed to maximize measured tokens-per-second uplift on realistic cross-rack topology:

Decoupled DiLoCo (Douillard et al., arXiv:2604.21428, April 2026): independent asynchronous learners, minimum quorum, adaptive grace window, token-weighted merge.
DES-LOC / Local Adam (Iacob et al., arXiv:2505.22549, May 2025; ICLR 2026): desynchronized synchronization periods for adaptive-optimizer momenta.
Async tensor parallelism (PyTorch / TorchTitan, September 2024): intra-node overlap of all-gather / reduce-scatter with matmul.
FALCON fail-slow mitigation (arXiv:2410.12588, October 2024): sliding-window z-score detection of slow nodes to exclude them from the current quorum round.

The SDK keeps intra-rack TP / CP / PP / FSDP collectives unchanged (vanilla NCCL inside the NVLink domain) and replaces only the off-rack DP boundary with the four-paper stack above.

License and IP posture

This SDK is licensed under Apache-2.0 with its full automatic patent grant. The SDK is patent-independent by deliberate construction: it does not exercise the K-Pool LoRA (US App. 64/060,315) or Infinity (US App. 64/055,093) patent estates that the companion SDK (tsugiai-kpool-sdk, also Apache-2.0) does. Read the preamble at the top of the LICENSE file for the full posture explanation.

The companion patent-aligned SDK at github.com/tsugiai/tsugi-kpool is the reduction-to-practice artifact for those two TsugiCinema patent estates. The two SDKs share zero code.

Headline measurement

Workload	Hardware	Measurement	Reference
Statistical-confidence headline (Hopper 3-seed CI)	Modal H100:1, Qwen-2.5-1.5B, 200 steps × 3 seeds at 2000ms	+71.49% ± 2.83% (95% CI) throughput uplift	`docs/phase2_multi_seed_ci_results.md`
Cross-rack grace-window overlap on Hopper at 7B production scale	Modal H100:1, Qwen-2.5-7B, 200 steps × 5 delays	+76.58% at 2000ms; +39.72% at 1000ms; +19.20% at 500ms	`docs/phase2_delay_sweep_qwen7b_results.md`
Intermediate model-scale (3B) confirmation (Hopper 3-seed CI)	Modal H100:1, Qwen-2.5-3B, 200 steps × 4 delays × 3 seeds	+41.31% ± 0.29% at 2000ms (n=3); +20.40% ± 0.03% at 1000ms; +9.75% ± 0.04% at 500ms	Continuation doc 2026-05-23
Cross-rack grace-window overlap on Hopper at 1.5B model scale	Modal H100:1, Qwen-2.5-1.5B, 200 steps × 7 delays	+70.64% at 2000ms; +34.73% at 1000ms; +16.86% at 500ms; concurrent path constant ~30,840 tok/s across all delays	`docs/phase2_delay_sweep_results.md`
Cross-rack grace-window overlap on A10G	Modal A10G, SmolLM-135M, 200 steps × 7 delays	+52.75% at 2000ms (constant); +11.61% at 500ms; -0.06% overhead at 0ms; bit-exact loss preserved across all cells	`docs/phase2_delay_sweep_results.md`
A10G delay-distribution stress-test sweep	Modal A10G, SmolLM-135M, 200 steps × 4 delays × 3 distributions	constant +52.53% / bimodal-stress +8.29% / long-tail-stress +19.44% at delay=2000ms (parameters NOT FALCON-anchored; see briefs 06/07/08 in MasterVision)	Continuation doc 2026-05-23
Production-realistic multi-GPU FSDP + 7B model (3-seed CI)	Modal 8xH100 FSDP FULL_SHARD, Qwen-2.5-7B + simulated 2-rack, 4 delays × 3 seeds	+6.37% ± 1.31% at delay=2000ms (n=3; replaces the prior +7.36% single-seed)	Continuation doc 2026-05-23
Multi-GPU FSDP smaller model	Modal 8xH100 FSDP, Qwen-2.5-1.5B + simulated 2-rack, 7 delays	+3.08% at 2000ms (synthetic 8xGPU floor)	`docs/stage_c_phase2_delay_sweep_results.md`
Real cross-network 2-node 8xV100 (synchronous reducer; preserved baseline)	Lambda Labs commodity Ethernet, SmolLM-135M, 500 paired steps	+28.58% tokens-per-second uplift vs vanilla FSDP, bit-exact-identical loss	`docs/stage_d_proper_results.md`
H100 Hopper single-instance (synchronous reducer baseline)	Modal 8x H100 SXM5, Llama-3-8B, 2000 paired steps × 3 seeds	-0.97% ± 1.5% (predicted null; Hopper NVLink absorbs the synchronous-path cross-rack tax)	`docs/stage_e_results.md`
H100 Hopper single-instance + concurrent orchestrator (Track A, 3-seed CI)	Modal 8x H100 SXM5, Llama-3-8B, 200 paired steps × 3 seeds at delay=2000ms	+2.88% ± 0.43% (n=3) throughput uplift; loss equivalence preserved (sync 0.288, conc 0.278)	Continuation doc 2026-05-23
Item E DeepSpeed ZeRO-3 head-to-head measured orthogonality	Modal 8x H100 SXM5, Qwen-2.5-7B + DeepSpeed ZeRO-3 (stage=3, overlap_comm=True), 200 paired steps × 3 seeds at delay=2000ms	+29.6% concurrent vs sync (n=3 paired; per-seed jitter <1pp); concurrent +0.44% vs no-delay baseline. DeepSpeed's intra-iteration overlap cannot recover the outer-step wait; Mend's outer-step concurrent_outer_step recovers essentially all of it	(internal benchmark report)
Real cross-network 2-pod 8xH100 (Stage E-prime; production-fabric floor; n=1 paired)	RunPod 2x 8x H100 SXM5 over real InfiniBand or RoCE v2 3.2 Tbps (AP-IN-1), Llama-3-8B, 500 paired steps × 1 seed	+1.42% tps + 0.18% loss delta (effectively bit-exact); production-fabric floor anchor; n=1 caveat is load-bearing (same order of magnitude as +1.40% baseline-only seed variance); n=3 CI pending	`docs/stage_e_prime_results.md`

The +71.49% ± 2.83% (95% CI; n=3) Hopper headline is the canonical enterprise cross-rack DD-grade measurement. The orchestrator's uplift is governed by N * T_step / G (sync_period_steps × per-step compute time, vs grace_window_ms); apparent "non-monotonicity with model size" at H100:1 (Qwen-3B +41.31% sits below both Qwen-1.5B +71.49% and Qwen-7B +76.58%) is fully explained by the Qwen-7B measurement using 1/8 the tokens-per-step (seq_len=1024 mbs=1 vs seq_len=2048 mbs=4). At fixed tokens-per-step, uplift is monotonically decreasing in model size. The analytical model and the proposed N* = ceil(G / T_step) auto-tuner spec are documented in an internal companion brief on uplift-surface characterization. Production-realistic multi-GPU FSDP yields a smaller honest floor (+6.37% ± 1.31% n=3 paired at delay=2000ms; replaces the prior +7.36% single-seed) because 8-rank NCCL pipelining absorbs some of the synthetic delay; the real Stage D-proper measurement on actual cross-network would lift this number back up.

Delay-distribution stress-test disclosure (Track D 2026-05-23): the constant-delay headlines (e.g., +52.53% A10G at delay=2000ms) are ceiling-case stress tests. The bimodal and log-normal variants are alternative stress-test shapes whose parameters are NOT directly FALCON-anchored: FALCON Table 2 reports only inter-node RDMA CoV=0.29 as the quantitative variance number; it does NOT publish per-iteration percentile breakdowns or bimodal characterizations. The current bimodal (80/20 at 50ms/base) delivers +8.29% on the same A10G + SmolLM-135M workload; the current long_tail (sigma=1.0) delivers +19.44%. A FALCON-CoV-anchored re-tune (95/5 bimodal, sigma~0.285 log-normal) is proposed in an internal FALCON-distribution-verification brief. Until that re-measurement lands, the +28.58% Stage D-proper Lambda V100 cross-network result remains the most defensible production-grounded headline.

At every scale, the concurrent path's throughput is rock-solid across all delays (Qwen-7B single-process: 4,300 ± 80 tok/s; Qwen-3B Hopper: 18,153 ± 4 tok/s; Qwen-1.5B Hopper: 30,840 ± 35 tok/s; SmolLM-135M A10G: 23,610 ± 80 tok/s); the synchronous baseline collapses linearly. The FALCON paper (arXiv:2410.12588) documents cross-rack inter-node RDMA variance (CoV=0.29) but does not characterize the per-iteration latency distribution shape that this sweep parameterizes; the delay-sweep is a stress-test, not a literal FALCON-replay.

Status

Stage A through Stage E-prime all PASS. Phase 2 Week 1 (concurrent async-TP overlap with cross-rack reducer) shipped with the ConcurrentOuterStep orchestrator integrated. Stage D-proper for Hopper-cross-network real-fabric is point-estimate closed via Stage E-prime (RunPod 2x 8x H100 InfiniBand 3.2 Tbps, n=1 paired, +1.42% production-fabric floor); n=3 CI pending. Item E head-to-head against DeepSpeed ZeRO-3 confirms orthogonality at +29.6% concurrent vs sync on a Tier-1 hyperscaler stack. See docs/60_day_plan.md for the Phase 2 nine-week sprint.

Quickstart

from tsugi_mend import MendConfig, mend_init, mend_shutdown

config = MendConfig(
    quorum_min_learners=4,
    grace_window_ms=2000,
    token_weighted_merge=True,
    sync_period_steps=128,
    momentum_sync_period_steps=512,
    async_tp_enabled=True,
    # Phase 2 Week 1 (2026-05-22): orchestrator overlaps the cross-rack
    # outer-step wait with inner-step async-TP compute. Default True.
    # See docs/phase2_delay_sweep_results.md for the +52.75% measurement.
    concurrent_outer_step=True,
    failslow_zscore_threshold=3.0,
    failslow_window_steps=50,
    rack_aware=True,
    sideband_addr="tcp://0.0.0.0:51900",
    sideband_peers=("tcp://peer1:51900", "tcp://peer2:51900"),
    sideband_heartbeat_ms=100,
    diagnostics_dir="./results/mend_diag",
)

mend_init(model, config)
# ... train normally ...
mend_shutdown(model)

Layout

src/tsugi_mend/        SDK source
tests/                Stage A unit and integration tests (CPU-only)
benchmarks/           Stage B/C/D/E launch scripts (cloud-gated)
docs/                 architecture, benchmark protocol, convergence-equivalence sketch
examples/             minimal training-loop integration examples
scripts/              utility scripts (env audit, cost estimator)

Companion SDK

For LoRA-adapter-granularity productization that exercises the K-Pool LoRA and Infinity patent estates, see tsugiai-kpool-sdk. The two SDKs serve different acquirer-due-diligence legs:

Patent moat leg (kpool): the IP that goes into the Definitive Agreement's assignment schedule.
Operational uplift leg (max): the engineering artifact a partner can run Monday morning on their cluster.

Both legs are independent. Either can stand on its own.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

May 29, 2026

0.1.1

May 27, 2026

This version

0.1.0

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tsugi_mend-0.1.0.tar.gz (72.9 kB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tsugi_mend-0.1.0-py3-none-any.whl (50.2 kB view details)

Uploaded May 27, 2026 Python 3

File details

Details for the file tsugi_mend-0.1.0.tar.gz.

File metadata

Download URL: tsugi_mend-0.1.0.tar.gz
Upload date: May 27, 2026
Size: 72.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tsugi_mend-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`63fe138c22a77b47820e9b769af08ad9bb03f343a1c5863070514773738fb381`
MD5	`aaf3f15727c33177b8e768ee4cf572c8`
BLAKE2b-256	`71a98f9b33bbb2415a0bbd558cc651a808a0d8624293a245b4df5b0a5749effa`

See more details on using hashes here.

File details

Details for the file tsugi_mend-0.1.0-py3-none-any.whl.

File metadata

Download URL: tsugi_mend-0.1.0-py3-none-any.whl
Upload date: May 27, 2026
Size: 50.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tsugi_mend-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`31ba17bda0dad80365e406d2fe7b5ec41806525e2fa625503048b9aae274ef8e`
MD5	`641d78b08ce529d5a24bafb1a6af72d2`
BLAKE2b-256	`75ac4ae53a49440755e968146eb1ed5abaf51776cfed6b6a200d0cee04a73ddc`

See more details on using hashes here.

tsugi-mend 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tsugiai-mend-sdk

License and IP posture

Headline measurement

Status

Quickstart

Layout

Companion SDK

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes