Skip to main content

A high-performance tensor library with CUDA acceleration

Project description


⚡ Tensorax

A from-scratch tensor library with hand-written CUDA kernels.

No PyTorch. No NumPy. Pure C++/CUDA + Python.


PyPI Python Downloads License CUDA Tests Coverage


Usage Guide · Architecture · Contributing · Examples




🔩   Zero heavy dependencies

Only pybind11 — no PyTorch, NumPy, or cuBLAS at runtime.

⚡   Hand-written CUDA kernels

6 matmul variants, 4 attention kernels, 14 element-wise ops — all from scratch.

🧠   Full autograd engine

Reverse-mode autodiff with gradient tracking through 18+ operations.

🎯   PyTorch-like API

Familiar Tensor, nn.Module, optim.Adam interface — minimal learning curve.

🧱   Batteries included

Linear, ReLU, LayerNorm, BatchNorm, Dropout, GQA, Flash Attention — ready to train.

📚   Built to learn from

Clean, readable implementation of a DL framework from first principles.




Get Started

pip install tensorax
from tensorax import Tensor, nn, optim, functional as F

# Build
model = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.LayerNorm(8), nn.Linear(8, 3))
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train
for epoch in range(100):
    loss = F.mse_loss(model(x_train), y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Full usage guide with all APIs, code examples, and details: docs/USAGE.md




What's Inside


Core

  • Tensor with CPU ↔ CUDA
  • Broadcasting arithmetic
  • sum, mean with keepdim
  • reshape, transpose
  • exp, log, sqrt, pow
  • 13 dtype constants

Neural Networks

  • Linear, Sequential
  • ReLU, Sigmoid, Tanh, Softmax
  • LayerNorm, RMSNorm, BatchNorm
  • Dropout
  • Module base class

Training

  • SGD with momentum
  • Adam with bias correction
  • MSE, CrossEntropy, CE from logits
  • Autograd through 18+ ops

Attention

  • Scaled dot-product attention
  • 4 CUDA kernels (naive → flash)
  • Grouped Query Attention
  • Causal & padding masks

CUDA Kernels

  • 6 matmul implementations
  • 14 element-wise ops
  • Parallel reductions
  • Tiled + coalesced access

Infra

  • 400 tests, 98% coverage
  • CI/CD with GitHub Actions
  • pybind11 bindings
  • Automatic CUDA fallback



Performance

Matrix Multiplication — fp32, 3×1024×1024, 100 runs:

PyTorch CUDA (ref)         ████████████████████████████████████████████  0.41s  (4.51×)
Tensorax 1D Block Tiling   ██████████████████████████████████████████    0.95s  (2.31×)  ← best
Tensorax Tiled             ████████████████████████████████              1.22s  (1.80×)
NumPy CPU (baseline)       █████████████████████████                    1.85s  (1.00×)

2.31× faster than NumPy · 43% of PyTorch's cuBLAS kernels · all hand-written, zero library calls

Attention Kernels — 4 implementations from naive to flash, supporting arbitrary batch/heads, asymmetric sequence lengths, and optional masks.




Project Structure

csrc/                           C++ / CUDA backend
  cuda/kernels/                   elementwise · matmul (×6) · reduction · attention (×4)
  cpu/                            CPU fallback for all ops
  tensor_ops.{cpp,h}             pybind11 bindings

tensorax/                       Python package
  tensor.py                       Tensor class + autograd (1100 lines)
  functional.py                   F.relu, F.softmax, F.sdpa, ...
  nn/                             Linear, norms, dropout, attention, GQA
  optim.py                        SGD, Adam



Roadmap

Status
Core ops · autograd · NN layers · norms · optimizers · losses · attention (4 CUDA kernels) · GQA · matmul (6 variants)
🚧 Multi-head attention with projections · expanded benchmarking
🔮 Conv2D · MaxPool2D · GELU/Swish · AdamW · LR schedulers · indexing/slicing · serialization · multi-GPU · mixed precision · DDP



Documentation

Usage Guide API reference, code examples, training patterns
Architecture System design, kernel strategy, autograd internals
Development Build, test, contribute
Examples Runnable scripts for common tasks



Contributing

Fork → Branch → Commit → PR

See DEVELOPMENT.md for build instructions and guidelines.




Citation

@software{tensorax2025,
  title  = {Tensorax: Pure C++/CUDA Tensor Library},
  author = {Shrirang Mahajan},
  year   = {2025},
  url    = {https://github.com/NotShrirang/tensorax}
}



GitHub  ·  Issues  ·  Discussions

Built with ❤️ by @NotShrirang

⭐ Star if you find this useful

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensorax-0.2.0.tar.gz (46.8 kB view details)

Uploaded Source

File details

Details for the file tensorax-0.2.0.tar.gz.

File metadata

  • Download URL: tensorax-0.2.0.tar.gz
  • Upload date:
  • Size: 46.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tensorax-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3d37fe26429c41c80b219cd0ef4b4fb31674fc20028204f12c2e91f7fe3c5b8b
MD5 31d1205bedbb14560747c4d05a1a0376
BLAKE2b-256 ed4cb2f24dac45983234b0a8d2ed3f2cf29688227d64f59ca9100105981366ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page