A high-performance tensor library with CUDA acceleration

These details have not been verified by PyPI

Project links

Project description

⚡ Tensorax

A from-scratch tensor library with hand-written CUDA kernels.

No PyTorch. No NumPy. Pure C++/CUDA + Python.

Usage Guide · Architecture · Contributing · Examples

🔩 Zero heavy dependencies

Only pybind11 — no PyTorch, NumPy, or cuBLAS at runtime.

⚡ Hand-written CUDA kernels

6 matmul variants, 4 attention kernels, 14 element-wise ops — all from scratch.

🧠 Full autograd engine

Reverse-mode autodiff with gradient tracking through 18+ operations.

🎯 PyTorch-like API

Familiar Tensor, nn.Module, optim.Adam interface — minimal learning curve.

🧱 Batteries included

Linear, ReLU, LayerNorm, BatchNorm, Dropout, GQA, Flash Attention — ready to train.

📚 Built to learn from

Clean, readable implementation of a DL framework from first principles.

Get Started

pip install tensorax

from tensorax import Tensor, nn, optim, functional as F

# Build
model = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.LayerNorm(8), nn.Linear(8, 3))
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train
for epoch in range(100):
    loss = F.mse_loss(model(x_train), y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

→ Full usage guide with all APIs, code examples, and details: docs/USAGE.md

What's Inside

Core

Tensor with CPU ↔ CUDA
Broadcasting arithmetic
sum, mean with keepdim
reshape, transpose
exp, log, sqrt, pow
13 dtype constants

Neural Networks

Linear, Sequential
ReLU, Sigmoid, Tanh, Softmax
LayerNorm, RMSNorm, BatchNorm
Dropout
Module base class

Training

SGD with momentum
Adam with bias correction
MSE, CrossEntropy, CE from logits
Autograd through 18+ ops

Attention

Scaled dot-product attention
4 CUDA kernels (naive → flash)
Grouped Query Attention
Causal & padding masks

CUDA Kernels

6 matmul implementations
14 element-wise ops
Parallel reductions
Tiled + coalesced access

Infra

400 tests, 98% coverage
CI/CD with GitHub Actions
pybind11 bindings
Automatic CUDA fallback

Performance

Matrix Multiplication — fp32, 3×1024×1024, 100 runs:

PyTorch CUDA (ref)         ████████████████████████████████████████████  0.41s  (4.51×)
Tensorax 1D Block Tiling   ██████████████████████████████████████████    0.95s  (2.31×)  ← best
Tensorax Tiled             ████████████████████████████████              1.22s  (1.80×)
NumPy CPU (baseline)       █████████████████████████                    1.85s  (1.00×)

2.31× faster than NumPy · 43% of PyTorch's cuBLAS kernels · all hand-written, zero library calls

Attention Kernels — 4 implementations from naive to flash, supporting arbitrary batch/heads, asymmetric sequence lengths, and optional masks.

Project Structure

csrc/                           C++ / CUDA backend
  cuda/kernels/                   elementwise · matmul (×6) · reduction · attention (×4)
  cpu/                            CPU fallback for all ops
  tensor_ops.{cpp,h}             pybind11 bindings

tensorax/                       Python package
  tensor.py                       Tensor class + autograd (1100 lines)
  functional.py                   F.relu, F.softmax, F.sdpa, ...
  nn/                             Linear, norms, dropout, attention, GQA
  optim.py                        SGD, Adam

Roadmap

Status
✅	Core ops · autograd · NN layers · norms · optimizers · losses · attention (4 CUDA kernels) · GQA · matmul (6 variants)
🚧	Multi-head attention with projections · expanded benchmarking
🔮	Conv2D · MaxPool2D · GELU/Swish · AdamW · LR schedulers · indexing/slicing · serialization · multi-GPU · mixed precision · DDP

Documentation


Usage Guide	API reference, code examples, training patterns
Architecture	System design, kernel strategy, autograd internals
Development	Build, test, contribute
Examples	Runnable scripts for common tasks

Contributing

Fork → Branch → Commit → PR

See DEVELOPMENT.md for build instructions and guidelines.

Citation

@software{tensorax2025,
  title  = {Tensorax: Pure C++/CUDA Tensor Library},
  author = {Shrirang Mahajan},
  year   = {2025},
  url    = {https://github.com/NotShrirang/tensorax}
}

GitHub · Issues · Discussions

Built with ❤️ by @NotShrirang

⭐ Star if you find this useful

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

May 7, 2026

0.3.0

May 1, 2026

0.2.1

Apr 15, 2026

This version

0.2.0

Mar 27, 2026

0.1.7

Mar 27, 2026

0.1.6

Mar 13, 2026

0.1.5

Feb 7, 2026

0.1.4

Feb 6, 2026

0.1.3

Dec 25, 2025

0.1.2

Dec 19, 2025

0.1.1

Dec 15, 2025

0.1.0

Dec 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensorax-0.2.0.tar.gz (46.8 kB view details)

Uploaded Mar 27, 2026 Source

File details

Details for the file tensorax-0.2.0.tar.gz.

File metadata

Download URL: tensorax-0.2.0.tar.gz
Upload date: Mar 27, 2026
Size: 46.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tensorax-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`3d37fe26429c41c80b219cd0ef4b4fb31674fc20028204f12c2e91f7fe3c5b8b`
MD5	`31d1205bedbb14560747c4d05a1a0376`
BLAKE2b-256	`ed4cb2f24dac45983234b0a8d2ed3f2cf29688227d64f59ca9100105981366ce`

See more details on using hashes here.

tensorax 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

⚡ Tensorax

A from-scratch tensor library with hand-written CUDA kernels.

🔩 Zero heavy dependencies

⚡ Hand-written CUDA kernels

🧠 Full autograd engine

🎯 PyTorch-like API

🧱 Batteries included

📚 Built to learn from

Get Started

What's Inside

Performance

Project Structure

Roadmap

Documentation

Contributing

Citation

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes