A high-performance tensor library with CUDA acceleration
Project description
⚡ Tensorax
A from-scratch tensor library with hand-written CUDA kernels.
No PyTorch. No NumPy. Pure C++/CUDA + Python.
Usage Guide · Architecture · Contributing · Examples
🔩 Zero heavy dependenciesOnly ⚡ Hand-written CUDA kernels6 matmul variants, 4 attention kernels, 14 element-wise ops — all from scratch. 🧠 Full autograd engineReverse-mode autodiff with gradient tracking through 18+ operations. |
🎯 PyTorch-like APIFamiliar 🧱 Batteries includedLinear, ReLU, LayerNorm, BatchNorm, Dropout, GQA, Flash Attention — ready to train. 📚 Built to learn fromClean, readable implementation of a DL framework from first principles. |
Get Started
pip install tensorax
from tensorax import Tensor, nn, optim, functional as F
# Build
model = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.LayerNorm(8), nn.Linear(8, 3))
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Train
for epoch in range(100):
loss = F.mse_loss(model(x_train), y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
→ Full usage guide with all APIs, code examples, and details: docs/USAGE.md
What's Inside
|
Core
|
Neural Networks
|
Training
|
|
Attention
|
CUDA Kernels
|
Infra
|
Performance
Matrix Multiplication — fp32, 3×1024×1024, 100 runs:
PyTorch CUDA (ref) ████████████████████████████████████████████ 0.41s (4.51×)
Tensorax 1D Block Tiling ██████████████████████████████████████████ 0.95s (2.31×) ← best
Tensorax Tiled ████████████████████████████████ 1.22s (1.80×)
NumPy CPU (baseline) █████████████████████████ 1.85s (1.00×)
2.31× faster than NumPy · 43% of PyTorch's cuBLAS kernels · all hand-written, zero library calls
Attention Kernels — 4 implementations from naive to flash, supporting arbitrary batch/heads, asymmetric sequence lengths, and optional masks.
Project Structure
csrc/ C++ / CUDA backend
cuda/kernels/ elementwise · matmul (×6) · reduction · attention (×4)
cpu/ CPU fallback for all ops
tensor_ops.{cpp,h} pybind11 bindings
tensorax/ Python package
tensor.py Tensor class + autograd (1100 lines)
functional.py F.relu, F.softmax, F.sdpa, ...
nn/ Linear, norms, dropout, attention, GQA
optim.py SGD, Adam
Roadmap
| Status | |
|---|---|
| ✅ | Core ops · autograd · NN layers · norms · optimizers · losses · attention (4 CUDA kernels) · GQA · matmul (6 variants) |
| 🚧 | Multi-head attention with projections · expanded benchmarking |
| 🔮 | Conv2D · MaxPool2D · GELU/Swish · AdamW · LR schedulers · indexing/slicing · serialization · multi-GPU · mixed precision · DDP |
Documentation
| Usage Guide | API reference, code examples, training patterns |
| Architecture | System design, kernel strategy, autograd internals |
| Development | Build, test, contribute |
| Examples | Runnable scripts for common tasks |
Contributing
Fork → Branch → Commit → PR
See DEVELOPMENT.md for build instructions and guidelines.
Citation
@software{tensorax2025,
title = {Tensorax: Pure C++/CUDA Tensor Library},
author = {Shrirang Mahajan},
year = {2025},
url = {https://github.com/NotShrirang/tensorax}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tensorax-0.1.7.tar.gz.
File metadata
- Download URL: tensorax-0.1.7.tar.gz
- Upload date:
- Size: 43.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ed21a10a8e3ae9e4b2e65ee2461378b9ea56647f01280bad366739f04a9c6c0
|
|
| MD5 |
5f331e73a803d90e815753edeabdc68e
|
|
| BLAKE2b-256 |
723f33b1aeaa2ee6e61b9b3e0b3d128be63b68e6ffd419334039bdcc4cfbdb13
|