Skip to main content

A high-performance tensor library with CUDA acceleration

Project description


⚡ Tensorax

A from-scratch tensor library with hand-written CUDA kernels.

No PyTorch. No NumPy. Pure C++/CUDA + Python.


PyPI Python Downloads License CUDA Tests Coverage


Usage Guide · Architecture · Contributing · Examples




🔩   Zero heavy dependencies

Only pybind11 — no PyTorch, NumPy, or cuBLAS at runtime.

⚡   Hand-written CUDA kernels

6 matmul variants, 4 attention kernels, 14 element-wise ops — all from scratch.

🧠   Full autograd engine

Reverse-mode autodiff with gradient tracking through 18+ operations.

🎯   PyTorch-like API

Familiar Tensor, nn.Module, optim.Adam interface — minimal learning curve.

🧱   Batteries included

Linear, ReLU, LayerNorm, BatchNorm, Dropout, GQA, Flash Attention — ready to train.

📚   Built to learn from

Clean, readable implementation of a DL framework from first principles.




Get Started

pip install tensorax
from tensorax import Tensor, nn, optim, functional as F

# Build
model = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.LayerNorm(8), nn.Linear(8, 3))
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train
for epoch in range(100):
    loss = F.mse_loss(model(x_train), y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Full usage guide with all APIs, code examples, and details: docs/USAGE.md




What's Inside


Core

  • Tensor with CPU ↔ CUDA
  • Broadcasting arithmetic
  • sum, mean with keepdim
  • reshape, transpose
  • exp, log, sqrt, pow
  • 13 dtype constants

Neural Networks

  • Linear, Sequential
  • ReLU, Sigmoid, Tanh, Softmax
  • LayerNorm, RMSNorm, BatchNorm
  • Dropout
  • Module base class

Training

  • SGD with momentum
  • Adam with bias correction
  • MSE, CrossEntropy, CE from logits
  • Autograd through 18+ ops

Attention

  • Scaled dot-product attention
  • 4 CUDA kernels (naive → flash)
  • Grouped Query Attention
  • Causal & padding masks

CUDA Kernels

  • 6 matmul implementations
  • 14 element-wise ops
  • Parallel reductions
  • Tiled + coalesced access

Infra

  • 400 tests, 98% coverage
  • CI/CD with GitHub Actions
  • pybind11 bindings
  • Automatic CUDA fallback



Performance

Matrix Multiplication — fp32, 3×1024×1024, 100 runs:

PyTorch CUDA (ref)         ████████████████████████████████████████████  0.41s  (4.51×)
Tensorax 1D Block Tiling   ██████████████████████████████████████████    0.95s  (2.31×)  ← best
Tensorax Tiled             ████████████████████████████████              1.22s  (1.80×)
NumPy CPU (baseline)       █████████████████████████                    1.85s  (1.00×)

2.31× faster than NumPy · 43% of PyTorch's cuBLAS kernels · all hand-written, zero library calls

Attention Kernels — 4 implementations from naive to flash, supporting arbitrary batch/heads, asymmetric sequence lengths, and optional masks.




Project Structure

csrc/                           C++ / CUDA backend
  cuda/kernels/                   elementwise · matmul (×6) · reduction · attention (×4)
  cpu/                            CPU fallback for all ops
  tensor_ops.{cpp,h}             pybind11 bindings

tensorax/                       Python package
  tensor.py                       Tensor class + autograd (1100 lines)
  functional.py                   F.relu, F.softmax, F.sdpa, ...
  nn/                             Linear, norms, dropout, attention, GQA
  optim.py                        SGD, Adam



Roadmap

Status
Core ops · autograd · NN layers · norms · optimizers · losses · attention (4 CUDA kernels) · GQA · matmul (6 variants)
🚧 Multi-head attention with projections · expanded benchmarking
🔮 Conv2D · MaxPool2D · GELU/Swish · AdamW · LR schedulers · indexing/slicing · serialization · multi-GPU · mixed precision · DDP



Documentation

Usage Guide API reference, code examples, training patterns
Architecture System design, kernel strategy, autograd internals
Development Build, test, contribute
Examples Runnable scripts for common tasks



Contributing

Fork → Branch → Commit → PR

See DEVELOPMENT.md for build instructions and guidelines.




Citation

@software{tensorax2025,
  title  = {Tensorax: Pure C++/CUDA Tensor Library},
  author = {Shrirang Mahajan},
  year   = {2025},
  url    = {https://github.com/NotShrirang/tensorax}
}



GitHub  ·  Issues  ·  Discussions

Built with ❤️ by @NotShrirang

⭐ Star if you find this useful

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensorax-0.1.7.tar.gz (43.7 kB view details)

Uploaded Source

File details

Details for the file tensorax-0.1.7.tar.gz.

File metadata

  • Download URL: tensorax-0.1.7.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tensorax-0.1.7.tar.gz
Algorithm Hash digest
SHA256 6ed21a10a8e3ae9e4b2e65ee2461378b9ea56647f01280bad366739f04a9c6c0
MD5 5f331e73a803d90e815753edeabdc68e
BLAKE2b-256 723f33b1aeaa2ee6e61b9b3e0b3d128be63b68e6ffd419334039bdcc4cfbdb13

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page