Bit-exact deterministic deep learning training across heterogeneous hardware
Project description
CARBON
Exact copy, every time, any hardware
Bit-Exact Deterministic Training Across Heterogeneous Infrastructure
Why "Carbon"? A carbon copy is an exact duplicate — zero deviation, every detail preserved. Train on 8 GPUs, train on 64. A100s, H100s, B200s. Carbon: same weights, same gradients, same loss. The name is the guarantee.
The Problem
Training is not reproducible. Same model, same data, different GPU count — different results. Three sources:
- Floating-point non-associativity —
(a+b)+c ≠ a+(b+c). Different parallelism = different summation order = different bits. - Non-deterministic CUDA kernels — cuBLAS picks algorithms at runtime. Atomics race.
- NCCL collectives — allreduce arrival order varies run to run.
Alignment teams can't study what a training change did vs. what was floating-point noise. This blocks interpretability research.
How Carbon Works
Every non-deterministic op replaced with a deterministic equivalent:
| Operation | Standard | Carbon |
|---|---|---|
| Summation | Non-associative accumulation | Kahan compensated, canonical sorted order |
| MatMul | cuBLAS (algorithm varies) | Tiled with fixed reduction order + Kahan |
| AllReduce | NCCL (arrival order varies) | AllGather + local reduce in rank order |
| Scatter | Atomic race conditions | Sorted index operations |
Install
pip install -e ".[dev]"
Quick Start
import carbon
carbon.enable(seed=42)
# training is now bit-exact deterministic
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()
Full Wrapper
from carbon import DeterministicTrainer
trainer = DeterministicTrainer(model, optimizer, seed=42)
for batch in dataloader:
loss = trainer.step(batch, loss_fn=lambda m, b: m(b).loss)
# prove it
assert trainer.verify_determinism(batch, loss_fn)
Overhead
| Scale | Overhead | Notes |
|---|---|---|
| 500K param toy model | 1.07x | Matmul is small, overhead negligible |
| GPT-2 124M (60M trainable) | 10.1x | fp64 tiled matmul dominates |
The overhead is the price of bit-exact cross-architecture determinism. The fp64 Kahan-compensated tiled matmul is slower than cuBLAS because it trades speed for reproducibility. For alignment research, debugging, and auditing where reproducibility is non-negotiable — worth it. For production training — wait for v0.2 with CUDA-native deterministic kernels.
Proven Results — RTX 4090 vs RTX 5090
Different GPU architectures (Ada Lovelace vs Blackwell). Same seed. Same data. 50 and 200 training steps.
Standard PyTorch
| GPU | Weight Hash | Loss |
|---|---|---|
| RTX 4090 | 6a6e2bc1b29e831b... |
0.069844 |
| RTX 5090 | 46681ef8c8420252... |
0.069844 |
Same loss, DIFFERENT weights. The models converged to different local configurations because cuBLAS picked different internal algorithms on different silicon. You can't reproduce this run.
Carbon
| GPU | Weight Hash | Optimizer Hash | Loss |
|---|---|---|---|
| RTX 4090 (50 steps) | 62118e9c641a0150... |
— | 0.070026 |
| RTX 5090 (50 steps) | 62118e9c641a0150... |
— | 0.070026 |
| RTX 4090 (200 steps) | e2aa1052f4a9dcf3... |
39dc99f0f803efe7... |
0.045067 |
| RTX 5090 (200 steps) | e2aa1052f4a9dcf3... |
39dc99f0f803efe7... |
0.045067 |
Identical weights. Identical optimizer state. Identical loss. Different silicon.
Carbon achieved what PyTorch could not: bit-exact reproducible training across heterogeneous GPU architectures.
Citation
@article{sharma2026carbon,
title={Carbon: Bit-Exact Deterministic Training Across Heterogeneous Hardware},
author={Sharma, Tushar},
year={2026},
url={https://github.com/TxsharDev/carbon}
}
License
Apache-2.0 — Alia Labs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file alia_carbon-0.1.0.tar.gz.
File metadata
- Download URL: alia_carbon-0.1.0.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1341d5dd00450e034c06b64e09efdd6786a9de693730067adcb0306aa3312b4
|
|
| MD5 |
e1a161ec6beca12b3bb49bd1f82eb6c3
|
|
| BLAKE2b-256 |
df76e62e7cf7c30d51d4beb2928b4a7438c163b146de03615fdf8acd0a4aecd7
|
File details
Details for the file alia_carbon-0.1.0-py3-none-any.whl.
File metadata
- Download URL: alia_carbon-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89f13cf686f38ba4132e4c572dd8a03ed8618dc60e7b003860dc5faaee020a29
|
|
| MD5 |
ebfa891d99561219b35aad407e8d4135
|
|
| BLAKE2b-256 |
43282da7c821760440507bd3fa5c534ee5836d923eb8b234d9559f9d495fbf91
|