Skip to main content

GPU exact arithmetic - 512-bit precision, zero accumulation error

Project description

SimGen VLA - Zero-Error GPU Arithmetic

Drop-in PyTorch replacement with exact arithmetic. 512-bit precision (configurable to 16,384-bit). No accumulation error. Ever.

PyPI version Python 3.10+ License

License Required - Internal demonstration only. Contact kyle@simgen.dev for licensing.

Support development: ko-fi.com/kyleclouthier


The Problem: Floating-Point Lies

Every GPU computation accumulates tiny errors. These errors compound silently until your results are wrong.

import torch

# Classic floating-point failure
x = torch.tensor([1e16, 1.0, -1e16])
print(x.sum())  # 0.0  <- WRONG! Should be 1.0

# 10 million additions - error explodes
values = torch.ones(10_000_000) * 0.1
print(values.sum())  # 999999.9880... <- Should be 1000000.0

This affects: financial calculations, scientific simulations, physics engines, signal processing, cryptography, and any computation requiring precision.


The Solution: SimGen VLA

from simgen import vla

# Exact arithmetic - mathematically correct
x = vla.tensor([1e16, 1.0, -1e16])
print(x.sum())  # 1.0  <- CORRECT!

# 10 million additions - still exact
values = vla.ones(10_000_000) * 0.1
print(values.sum())  # 1000000.0  <- EXACTLY correct

No code changes. Same PyTorch API. Just import vla instead of torch.


Installation

pip install simgen-vla

Requirements:

  • Python 3.10, 3.11, or 3.12
  • PyTorch 2.0+ with CUDA
  • CuPy (matching your CUDA version: pip install cupy-cuda11x or cupy-cuda12x)
  • NVIDIA GPU (Pascal through Hopper: sm_60 to sm_90)

Platforms: Windows, Linux


What's New in v6.2

  • Cross-GPU Reproducibility: manual_seed() produces bit-identical results across ALL GPU architectures (RTX, Tesla, Ampere, etc.)
  • Deterministic Random: randn() and rand() use numpy-based RNG for cross-platform consistency
  • 94 GPU Operations: Complete arithmetic, linear algebra, and activation suite
  • 512-bit Precision: 8-limb fixed-point architecture (configurable up to 16,384 bits)
  • Custom CUDA Kernels: Every operation has a dedicated kernel - no library dependencies

Proprietary Technology

SimGen VLA is deep tech. This is not a wrapper around existing libraries.

  • Novel Algorithms: Proprietary error-free arithmetic developed from first principles
  • 94 Custom CUDA Kernels: Each operation (sum, matmul, exp, softmax, etc.) has its own handwritten kernel
  • Multi-Limb Architecture: Extends precision beyond hardware limits using proprietary accumulation methods
  • Precompiled Binaries: Optimized for 6 GPU architectures (sm_60 through sm_90)

No other library provides true zero-error GPU arithmetic at this scale.


Why This Matters

Standard FP64 arithmetic accumulates errors silently. VLA eliminates this entirely.

Domain Problem VLA Solution
Financial Rounding errors compound across transactions Exact to the penny
Scientific Simulation Results drift over long runs Deterministic, reversible
Quantum Computing Unitarity degrades with operations Preserved exactly
ML Training Gradient accumulation noise Clean gradients

Proven: Lorenz attractor forward/backward 10,000 steps returns to initial state exactly. Standard FP64 diverges completely.


Use Cases

Financial Computing

Mixed-magnitude calculations where every cent matters:

from simgen import vla

# Portfolio with massive range - standard FP loses the pennies
positions = vla.tensor([
    1_000_000_000.00,   # $1 billion position
    0.01,                # 1 cent transaction fee
    -999_999_999.99,     # Large short position
    50_000.50,           # Medium holding
])

total = positions.sum()
print(f"Portfolio: ${float(total):,.2f}")  # $50,000.52 - exact!

Scientific Simulation

Physics simulations that don't drift over time:

from simgen import vla

# Chaotic system (Lorenz attractor)
def lorenz_step(state, dt=0.01):
    x, y, z = state[0], state[1], state[2]
    sigma, rho, beta = 10.0, 28.0, 8.0/3.0

    dx = sigma * (y - x)
    dy = x * (rho - z) - y
    dz = x * y - beta * z

    return vla.tensor([x + dx * dt, y + dy * dt, z + dz * dt])

# Run forward then backward - returns to EXACTLY initial state
state = vla.tensor([1.0, 1.0, 1.0])
initial = state.clone()

for _ in range(10000):
    state = lorenz_step(state, dt=0.01)
for _ in range(10000):
    state = lorenz_step(state, dt=-0.01)

error = (state - initial).abs().sum()
print(f"Reversal error: {float(error)}")  # 0.0 with VLA!

Linear Algebra

Exact matrix decompositions and solvers:

from simgen import vla

# Matrix operations
A = vla.randn((100, 100))
B = vla.randn((100, 100))
C = vla.matmul(A, B)  # Exact matrix multiply

# LU Decomposition
L, U = vla.lu(A)

# QR Decomposition
Q, R = vla.qr(A)

# Eigenvalues (power iteration)
eigenvalue, eigenvector = vla.eig(A)

# Matrix inverse and determinant
A_inv = vla.inv(A)
det = vla.det(A)

# Solve linear system: Ax = b
x = vla.solve(A, b)

Signal Processing

FFT and convolutions with exact arithmetic:

from simgen import vla

# 2D Convolution
signal = vla.randn((1, 3, 64, 64))
kernel = vla.randn((16, 3, 3, 3))
output = vla.conv2d(signal, kernel)

Complete API Reference

Tensor Creation

from simgen import vla

x = vla.tensor([1.0, 2.0, 3.0])       # From list
z = vla.zeros((3, 3))                  # Zeros
o = vla.ones((100,))                   # Ones
r = vla.randn((10, 10))                # Random normal
u = vla.rand((5, 5))                   # Random uniform [0,1]
a = vla.arange(0, 10)                  # Range [0,1,2,...,9]
l = vla.linspace(0, 1, 100)            # 100 points from 0 to 1
I = vla.eye(5)                         # 5x5 identity matrix

# Cross-GPU reproducibility
vla.manual_seed(42)                    # Set seed for deterministic results
r = vla.randn((1024, 1024))            # Same result on ANY GPU

Arithmetic Operations

c = a + b          # Exact addition
c = a - b          # Exact subtraction
c = a * b          # Exact multiplication
c = a / b          # Exact division
c = -a             # Negation
c = a ** 2         # Power

Reductions (Zero Drift)

total = vla.sum(x)         # Exact sum
avg = vla.mean(x)          # Exact mean
product = vla.prod(x)      # Exact product
minimum = vla.min(x)       # Minimum
maximum = vla.max(x)       # Maximum
std_dev = vla.std(x)       # Standard deviation
variance = vla.var(x)      # Variance

Linear Algebra

C = vla.matmul(A, B)       # Matrix multiplication
C = vla.mm(A, B)           # Matrix-matrix multiply
y = vla.mv(A, x)           # Matrix-vector multiply
d = vla.dot(a, b)          # Dot product
C = vla.bmm(A, B)          # Batched matrix multiply
L, U = vla.lu(A)           # LU decomposition
Q, R = vla.qr(A)           # QR decomposition
e, v = vla.eig(A)          # Eigenvalue (power iteration)
det = vla.det(A)           # Determinant
inv = vla.inv(A)           # Matrix inverse
x = vla.solve(A, b)        # Solve Ax = b

Math Functions

y = vla.exp(x)             # Exponential
y = vla.log(x)             # Natural log
y = vla.sqrt(x)            # Square root
y = vla.abs(x)             # Absolute value
y = vla.sin(x)             # Sine
y = vla.cos(x)             # Cosine
y = vla.tan(x)             # Tangent
y = vla.tanh(x)            # Hyperbolic tangent
y = vla.sigmoid(x)         # Sigmoid

Activations

y = vla.relu(x)            # ReLU
y = vla.gelu(x)            # GELU
y = vla.silu(x)            # SiLU/Swish
y = vla.softmax(x)         # Softmax

Shape Operations

y = vla.reshape(x, (2, 3))       # Reshape
y = vla.transpose(x, 0, 1)       # Transpose dims
y = vla.squeeze(x)               # Remove size-1 dims
y = vla.unsqueeze(x, 0)          # Add dimension
y = vla.stack([a, b, c])         # Stack tensors
y = vla.cat([a, b])              # Concatenate

Exact Output

# Get TRUE exact value as Python Decimal
result = x.sum()
exact_value = result.to_decimal()  # Decimal('1.0') - mathematically exact

# SHA256 checksum for verification
hash_val = result.checksum()       # Verify across systems

Supported GPUs

Architecture Example GPUs Compute Capability
Pascal GTX 1080, P100, P40 sm_60, sm_61
Volta V100, Titan V sm_70
Turing RTX 2080, T4, Quadro RTX sm_75
Ampere RTX 3090, A100, A10 sm_80, sm_86
Ada Lovelace RTX 4090, 4080, 4070, L40 sm_89
Hopper H100, H200 sm_90

Cloud Support: AWS (P3, P4, G4, G5), GCP (T4, A100, L4), Azure (NC, ND series), Kaggle (T4 x2 free), Colab


Benchmarks

Operation Elements PyTorch Error VLA Error
Sum 10M 10^-7 relative 0.0
Dot Product 1M 10^-8 relative 0.0
Matrix Multiply 1000x1000 10^-6 relative 0.0
Chained Ops 1000 iterations Diverges Exact

FAQ

Q: Is this slower than PyTorch? A: Slightly. The overhead is typically 2-5x, which is negligible for applications where correctness matters.

Q: What about CPU? A: GPU required. VLA's exact arithmetic relies on native CUDA kernels - no CPU support.

Q: Can I verify results across systems? A: Yes! Use to_decimal() for exact values or checksum() for verification.

Q: Are random numbers reproducible across different GPUs? A: Yes! Use vla.manual_seed(42) before generating random tensors. The same seed produces bit-identical results on RTX 4070, Tesla T4, A100, H100 - any GPU architecture.


Support & Contact

Website: simgen.dev

Support Development: ko-fi.com/kyleclouthier

Email: kyle@simgen.dev

GitHub: github.com/DigitalMax321/simgen


License

Proprietary. License required for all use. Contact kyle@simgen.dev for licensing.

(c) 2025-2026 Clouthier Simulation Labs. All rights reserved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simgen_vla-6.2.4-cp312-cp312-win_amd64.whl (5.1 MB view details)

Uploaded CPython 3.12Windows x86-64

simgen_vla-6.2.4-cp312-cp312-manylinux_2_17_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

simgen_vla-6.2.4-cp311-cp311-win_amd64.whl (5.1 MB view details)

Uploaded CPython 3.11Windows x86-64

File details

Details for the file simgen_vla-6.2.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: simgen_vla-6.2.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 5.1 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for simgen_vla-6.2.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 4dde8f31b856cf90c6e75bc24b169f09cf5ca39579e49f5fd6e86f66d90719f8
MD5 33de43ccfc138963f0994cd42421cc21
BLAKE2b-256 00048b5b90c994bafe0ec858476c054dec33b85cae60aaa837a83bf68da8ba0b

See more details on using hashes here.

File details

Details for the file simgen_vla-6.2.4-cp312-cp312-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for simgen_vla-6.2.4-cp312-cp312-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 063e4523a159ce55754fc2856d834b05b09c302fbc7a104917693c5823d86bd3
MD5 8d725240379f8b04d3c96f0404975017
BLAKE2b-256 4fdc0090e93d879d2b7c7aba0bfe8bfb34f93fec59df1c1acccf35d9f70ab07d

See more details on using hashes here.

File details

Details for the file simgen_vla-6.2.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: simgen_vla-6.2.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 5.1 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for simgen_vla-6.2.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 dc597235b9323c7c6be718772e0db9e138c5eda675ef41aabc708774e1d4fd5c
MD5 3490c545338ff1364f5827da912be772
BLAKE2b-256 38524dd40c362f52b3cf715e8883bfa10436f866e67a64f0a07f501518843921

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page