CUDA implementation of Manifold-Constrained Hyper-Connections
Project description
mHC.cu
unofficial CUDA implementation of mHC: Manifold-Constrained Hyper-Connections by DeepSeek-AI
Running on Modal
Once the image builds the first time, it will be cached and will not require a rebuild.
Supported GPUs
--gpu h100 # H100 80GB HBM3 SXM5 model
--gpu b200 # B200
Benchmark
Run benchmark suite
# python bench
modal run runmodal.py --gpu h100 --mode bench --scope python
# c++ / cuda bench
modal run runmodal.py --gpu h100 --mode bench --scope native
# run all benches
modal run runmodal.py --gpu h100 --mode bench --scope all
Generate benchmark files (this is automatically run in the above)
# generate the benchmark files
make benchgen
# check status of benchmark files
make benchgen-check
Test
# python tests
modal run runmodal.py --gpu h100 --mode test --scope python
# c++ / cuda tests
modal run runmodal.py --gpu h100 --mode test --scope native
# run all tests
modal run runmodal.py --gpu h100 --mode test --scope all
Training
This trainer approximates the paper’s small-model scaling with a dense Transformer and mHC residual mixing (no MoE/MLA). It uses the fused CUDA path for mHC dynamic H computation when available and streams loss to the Modal logs.
modal run runmodal.py --gpu b200 --mode train --train-args "\
--preset 3b \
--scale 0.25 \
--seq-len 1024 \
--batch-size 2 \
--grad-clip 1.0 \
--grad-accum 4 \
--max-steps 10 \
--sdp-kernel flash \
--logits-chunk-size 512 \
--recompute-ratio 0.9 \
--run-name train-3b \
--log-memory" \
--download
Download checkpoints and metrics after the run:
modal volume get mhc-runs /train-3b ./runs
Local
Installation
make install # install PyTorch extension
make install-dev # install with dev dependencies
Build
make # build C++ / CUDA source for all architectures
make CUDA_ARCH=90 # build for specific arch (H100)
make clean # clean build
Test
make test # C++ / CUDA tests
make test-python # Python tests
Benchmark
make bench # run all C++ / CUDA benchmarks
make bench-python # run all Python benchmarks
Pytorch Benchmark Results (benchmarked on H100 SXM5)
Fused mHC vs naive PyTorch mHC implementation (configs from paper Appendix A in section A.1):
Static H Path (shared H across batch):
| Batch | Hidden | n | Forward | Backward |
|---|---|---|---|---|
| 320 | 1280 | 4 | 15.20x | 10.07x |
| 512 | 1920 | 4 | 10.52x | 9.20x |
| 1280 | 2560 | 4 | 5.66x | 4.34x |
| 2560 | 1280 | 4 | 5.66x | 4.21x |
Dynamic H Path (per-batch H values computed via Equations 7-9 from paper):
| Batch | Hidden | n | Forward | Backward |
|---|---|---|---|---|
| 320 | 1280 | 4 | 7.39x | 3.35x |
| 512 | 1920 | 4 | 7.38x | 3.47x |
| 1280 | 2560 | 4 | 5.33x | 3.07x |
| 2560 | 1280 | 4 | 5.21x | 3.02x |
Format
make format # clang-format + python black formatting
Usage
import torch
from mhc import MHCLayer
# Dynamic H path (default, matches paper architecture)
# H values are computed from x via learned projections
layer = MHCLayer(hidden_dim=4096, expansion_rate=4).cuda()
x = torch.randn(8, 4, 4096, device="cuda") # [B, n, C]
y = layer(x) # [B, n, C]
# Static H path (shared H across batch, faster for inference)
layer_static = MHCLayer(hidden_dim=4096, expansion_rate=4, use_dynamic_h=False).cuda()
y = layer_static(x)
Contributing
See CONTRIBUTING.md for directions on how to contribute, including testing, formatting, and code style requirements.
Paper
mHC: Manifold-Constrained Hyper-Connections
https://arxiv.org/abs/2512.24880
DeepSeek-AI
Citation
@article{xie2025mhc,
title={mHC: Manifold-Constrained Hyper-Connections},
author={Xie, Zhenda and Wei, Yixuan and Cao, Huanqi and Zhao, Chenggang and Deng, Chengqi and Li, Jiashi and Dai, Damai and Gao, Huazuo and Chang, Jiang and Zhao, Liang and Zhou, Shangyan and Xu, Zhean and Zhang, Zhengyan and Zeng, Wangding and Hu, Shengding and Wang, Yuqing and Yuan, Jingyang and Wang, Lean and Liang, Wenfeng},
journal={arXiv preprint arXiv:2512.24880},
year={2025}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mhc_cuda-0.1.0.tar.gz.
File metadata
- Download URL: mhc_cuda-0.1.0.tar.gz
- Upload date:
- Size: 59.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ad7786700260252a07f99f1577c242d4ec79bb141771f744288fa1be50c304c
|
|
| MD5 |
eac67f3d10dbea267914676ecc251319
|
|
| BLAKE2b-256 |
f40d052e58a8f591e945f6614c1060ef4782f1897f826f353ee7ef274b9a1bac
|
Provenance
The following attestation bundles were made for mhc_cuda-0.1.0.tar.gz:
Publisher:
publish.yml on AndreSlavescu/mHC.cu
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mhc_cuda-0.1.0.tar.gz -
Subject digest:
6ad7786700260252a07f99f1577c242d4ec79bb141771f744288fa1be50c304c - Sigstore transparency entry: 1064069011
- Sigstore integration time:
-
Permalink:
AndreSlavescu/mHC.cu@a426939c2dbc11c443db041bcff12b65d1b6482a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/AndreSlavescu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a426939c2dbc11c443db041bcff12b65d1b6482a -
Trigger Event:
release
-
Statement type: