Pure-Rust GPU compute substrate with Python bindings. cuda-oxide-compiled Stockham FFT.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

ferrum-gpu

Pure-Rust GPU compute substrate with Python bindings. FFT kernels run on NVIDIA GPUs today via cuda-oxide (Rust source compiled to PTX, no CUDA C). Cross-vendor support via spirv-oxide → Vulkan is the v0.2 roadmap.

This is v0.1.0. The workspace ships:

ferrum-gpu-core: Backend trait, KernelArtifact, errors. no_std + alloc.
ferrum-gpu-cuda: impl Backend for Cuda over cudarc 0.19.
ferrum-gpu: facade with Device<B> and Buffer<T, B>.
ferrum-gpu-fft: 1D + 2D radix-2 power-of-2 C2C FFT host scaffolding + CPU Stockham reference.
ferrum-gpu-py: Python bindings via PyO3 + maturin. ferrum_gpu.cuda.Device(0) persistent handle + ferrum_gpu.fft.fft_1d_c2c_pow2 + ferrum_gpu.fft.fft_2d_c2c_pow2.
ferrum-gpu-bench: cuFFT comparison binary (1D, batched).
examples/vector-add: end-to-end demo using hand-written PTX through the substrate.
examples/vector-add-cuda-oxide: same kernel in Rust, compiled to PTX by cuda-oxide.
examples/fft-1d-c2c: 1D Stockham FFT in Rust, GPU-vs-CPU on 8 cases (N from 4 to 4096, batched, forward + inverse).

29 GPU pytest cases verified end-to-end against numpy.fft.fft / numpy.fft.fft2 (1D: 16 cases, 2D: 13 cases) within 1e-3 to 1e-4 relative error.

Requirements

Linux x86_64
CUDA Toolkit 13.x
NVIDIA driver compatible with the installed Toolkit
Rust nightly 2026-04-03 (pinned via rust-toolchain.toml)
cargo-oxide: cargo install --git https://github.com/NVlabs/cuda-oxide.git cargo-oxide
For the Python bindings: Python 3.10+ with maturin + numpy + pytest

Quick start: vector-add via hand-written PTX

git clone https://github.com/alejandro-soto-franco/ferrum-gpu
cd ferrum-gpu
make example-vector-add

Expected:

vector_add: 1048576 elements verified

Quick start: vector-add via Rust source + cuda-oxide

cargo install --git https://github.com/NVlabs/cuda-oxide.git cargo-oxide
cargo oxide doctor       # one-time codegen-backend bootstrap
make example-vector-add-oxide

Expected:

vector_add (cuda-oxide): 1048576 elements verified

Quick start: 1D Stockham FFT

make example-fft

Runs 8 cases (N=4 through N=4096, batched, forward + inverse), each verified against a CPU Stockham reference within 1e-4 relative error.

Quick start: Python

uv is the recommended Python package manager; the Makefile targets and the wheel install path work the same on pip for users who prefer it.

uv venv ~/.venvs/ferrum-gpu
source ~/.venvs/ferrum-gpu/bin/activate
uv pip install maturin pytest numpy
make develop                       # builds the cdylib + installs into the venv
python3 -c "
import numpy as np, ferrum_gpu as fg
arr = np.array([1+0j, 2+0j, 3+0j, 4+0j], dtype=np.complex64)
print(fg.fft.fft_1d_c2c_pow2(arr, log_n=2))
"

Pip equivalent:

python3 -m venv ~/.venvs/ferrum-gpu
source ~/.venvs/ferrum-gpu/bin/activate
pip install maturin pytest numpy
make develop

Run the pytest matrix:

make pytest

29 cases (16 1D + 13 2D), each compared against numpy.fft within 1e-3 to 1e-4 relative error.

Performance

make bench runs ferrum-gpu-bench, which times the in-tree cuda-oxide-compiled Stockham radix-2 power-of-2 C2C kernel against cuFFT (via cudarc 0.19's cufft feature) for batched 1D transforms at N in {256, 1024, 4096}, batch = 256, 100 trials per size + 10-trial warmup. Per-batch microseconds, measured on an RTX 5060 Laptop (sm_120):

N	ferrum_us	cufft_us	ratio
256	0.089	0.016	5.52
1024	0.162	0.059	2.72
4096	0.548	0.080	6.86

The Stockham kernel is a single-block-per-FFT reference implementation with no radix-4 or warp-specialised stages, so cuFFT's vendor-tuned plan wins outright at these sizes. Closing the gap is on the v0.2 roadmap.

Testing

CPU-only tests: make test.

GPU tests + all examples + pytest (requires CUDA + NVIDIA GPU): make verify-all.

Publishing (PyPI wheel)

The public wheel is built inside a manylinux_2_28_x86_64 Docker image that ships CUDA Toolkit 13.x, the cuda-oxide-pinned Rust nightly, and maturin. The container is ~6-8 GB and takes ~15-25 minutes to build the first time.

make wheel-manylinux        # builds dist/ferrum_gpu-*-manylinux_2_28_x86_64.whl
auditwheel show dist/*.whl  # verify the manylinux tag

Publishing to PyPI is operator-driven (no CI):

# TestPyPI first
twine upload --repository testpypi dist/*.whl

# PyPI (requires a token in ~/.pypirc)
twine upload dist/*.whl

The local-build path (make develop + make wheel) produces a wheel tagged linux_x86_64 (not manylinux). Useful for local testing only.

License

Apache-2.0.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

alejandrosotofranco

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ferrum_gpu-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl (360.6 kB view details)

Uploaded May 27, 2026 CPython 3.10+manylinux: glibc 2.34+ x86-64

File details

Details for the file ferrum_gpu-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

Download URL: ferrum_gpu-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl
Upload date: May 27, 2026
Size: 360.6 kB
Tags: CPython 3.10+, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ferrum_gpu-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`3e55e80f7dccadf26ce6d1fa0565a5165590ba0d7185d285b1c2cefa50377ce5`
MD5	`7002c72f19bd754a82fe662e34f2f584`
BLAKE2b-256	`6d07e320ccf59d145569f6c9fb354d700b2e8eb0ac4def7e7cf3f6099d6ad776`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ferrum_gpu-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on alejandro-soto-franco/ferrum-gpu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ferrum_gpu-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl
- Subject digest: 3e55e80f7dccadf26ce6d1fa0565a5165590ba0d7185d285b1c2cefa50377ce5
- Sigstore transparency entry: 1648646812
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: alejandro-soto-franco/ferrum-gpu@41625bee94c5178bb18d58b98d1975b539ef905a
- Branch / Tag: refs/heads/main
- Owner: https://github.com/alejandro-soto-franco
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@41625bee94c5178bb18d58b98d1975b539ef905a
- Trigger Event: workflow_dispatch

ferrum-gpu 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ferrum-gpu

Requirements

Quick start: vector-add via hand-written PTX

Quick start: vector-add via Rust source + cuda-oxide

Quick start: 1D Stockham FFT

Quick start: Python

Performance

Testing

Publishing (PyPI wheel)

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Provenance