A high-performance tensor library with CUDA acceleration
Project description
Tensorax
A high-performance tensor computation library with CUDA acceleration, designed for deep learning and numerical computing.
Built from scratch for deep learning and numerical computing with blazing-fast GPU acceleration.
โจ Features
- ๐ Pure C++/CUDA Backend: No PyTorch or NumPy dependencies - truly standalone
- โก Extreme Performance: Up to 448x speedup on GPU operations (1024ร1024 matmul)
- ๐ Complete Autograd: Full automatic differentiation with computational graph
- ๐ง PyTorch-like API: Familiar interface for easy adoption
- ๐ง Flexible Deployment: Works with or without CUDA - automatic fallback to CPU
๐ฏ Why Tensorax?
Unlike other libraries that wrap PyTorch or depend on NumPy, Tensorax is built completely from scratch:
- โ
Zero heavy dependencies - Only requires
pybind11for Python bindings - โ Production ready - Complete training pipeline with optimizers and backprop
- โ True CUDA acceleration - Hand-written kernels, not wrappers
- โ Educational - Clean, readable codebase perfect for learning DL internals
๐ฆ Installation
Platform Support
Currently supported:
- โ Linux (Ubuntu, Debian, Fedora, etc.)
- โ macOS (Intel and Apple Silicon)
Not yet supported:
- โ Windows (coming soon - contributions welcome!)
Prerequisites
- Python 3.8+
- C++17 compatible compiler (g++, clang++)
- CUDA Toolkit 11.0+ (optional, for GPU support)
- pybind11 (automatically installed)
Quick Install
From PyPI:
pip install tensorax
From Source:
git clone https://github.com/NotShrirang/tensorax.git
cd tensorax
bash build.sh # Automatically detects CUDA
pip install -e .
Manual Build
# CPU only
python setup.py build_ext --inplace
# With CUDA
CUDA_HOME=/usr/local/cuda python setup.py build_ext --inplace
From PyPI
pip install tensorax
๐ Quick Start
Run the Demo
python demo.py # Comprehensive showcase of all features
Basic Tensor Operations
from tensorax import Tensor
# Create tensors
a = Tensor([[1.0, 2.0], [3.0, 4.0]])
b = Tensor([[5.0, 6.0], [7.0, 8.0]])
# Arithmetic operations
c = a + b # Addition
d = a - b # Subtraction
e = a * b # Element-wise multiplication
f = a / b # Division
g = a @ b # Matrix multiplication
# Tensor properties
print(a.shape) # (2, 2)
print(a.T) # Transpose
print(a.device) # 'cpu' or 'cuda'
# Factory methods
zeros = Tensor.zeros((3, 3))
ones = Tensor.ones((2, 4))
rand = Tensor.randn((5, 5))
# GPU acceleration
if Tensor.cuda_is_available():
a_gpu = a.cuda()
b_gpu = b.cuda()
c_gpu = a_gpu @ b_gpu # 448x faster on 1024ร1024!
result = c_gpu.cpu()
Automatic Differentiation
from tensorax import Tensor
# Create tensors with gradient tracking
x = Tensor([[2.0]], requires_grad=True)
w = Tensor([[3.0]], requires_grad=True)
b = Tensor([[1.0]], requires_grad=True)
# Forward pass
y = w * x + b # y = 3*2 + 1 = 7
# Backward pass
y.backward()
# Gradients
print(x.grad) # dy/dx = 3
print(w.grad) # dy/dw = 2
print(b.grad) # dy/db = 1
Neural Networks & Training
from tensorax import nn, Tensor, optim, functional as F
# Define a model
model = nn.Sequential(
nn.Linear(4, 8),
nn.ReLU(),
nn.Linear(8, 3),
nn.Sigmoid()
)
# Create optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
for epoch in range(100):
# Forward pass
output = model(x_train)
loss = F.mse_loss(output, y_train)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f'Epoch {epoch}: Loss = {loss.tolist()[0]:.4f}')
Scaled Dot-Product Attention
from tensorax import Tensor, functional as F
from tensorax.nn.attention import ScaledDotProductAttention, create_causal_mask
batch, heads, seq_len, d_k = 2, 8, 64, 64
Q = Tensor.randn((batch, heads, seq_len, d_k))
K = Tensor.randn((batch, heads, seq_len, d_k))
V = Tensor.randn((batch, heads, seq_len, d_k))
# Basic attention
out = F.scaled_dot_product_attention(Q, K, V)
# Causal (autoregressive) attention
mask = create_causal_mask(seq_len, batch_size=batch, num_heads=heads)
out = F.scaled_dot_product_attention(Q, K, V, mask=mask)
# Layer-based usage
attn = ScaledDotProductAttention()
out = attn(Q, K, V, mask=mask)
# GPU acceleration
if Tensor.cuda_is_available():
out = F.scaled_dot_product_attention(Q.cuda(), K.cuda(), V.cuda())
Functional API
from tensorax import functional as F, Tensor
x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]])
# Activation functions
y1 = F.relu(x) # [0.0, 0.0, 0.0, 1.0, 2.0]
y2 = F.sigmoid(x) # [0.119, 0.269, 0.5, 0.731, 0.881]
y3 = F.tanh(x) # [-0.964, -0.762, 0.0, 0.762, 0.964]
y4 = F.softmax(x, dim=-1) # Normalized probabilities
# Loss functions
pred = Tensor([[2.0, 1.5, 3.0]])
target = Tensor([[2.5, 2.0, 2.5]])
loss = F.mse_loss(pred, target) # Mean squared error
Project Structure
tensorax/
โโโ csrc/ # C++ and CUDA source code
โ โโโ cuda/kernels/ # CUDA kernel implementations
โ โ โโโ elementwise.cu # Element-wise operations
โ โ โโโ reduction.cu # Sum, mean, max reductions
โ โ โโโ matmul.cu # Matrix multiplication (6 variants)
โ โ โโโ attn.cu # Attention kernels (naive, tiled, flash)
โ โโโ cpu/ # CPU implementations
โ โโโ tensor_ops.* # Core operations and pybind11 bindings
โโโ tensorax/ # Python package
โ โโโ tensor.py # Tensor class
โ โโโ functional.py # Functional API (relu, softmax, sdpa, ...)
โ โโโ nn/ # Neural network modules
โ โ โโโ attention/ # Attention layers and utilities
โ โโโ optim.py # Optimizers
โโโ tests/ # Test suite
โโโ examples/ # Usage examples
โโโ docs/ # Documentation
โก Performance
Tensorax uses hand-optimized CUDA kernels for maximum performance. Here are some benchmark results for matrix multiplication (fp32, 3x1024ร1024):
Matrix Multiplication Benchmark (100 runs)
Comparison of different CUDA kernel implementations vs NumPy and PyTorch:
| Implementation | Time (seconds) | Relative Performance |
|---|---|---|
| 1D Block Tiling (Best) | 0.95 | 2.31x faster |
| Tiled Matrix Multiply | 1.22 | 1.80x faster |
| NumPy (CPU) | 1.85 | Baseline (CPU) |
| Shared Memory Cache Blocking | 2.18 | 0.85x |
| Default CUDA | 3.37 | 0.55x |
| Shared Memory Coalescing | 3.44 | 0.54x |
| PyTorch CUDA (Reference) | 0.41 | 4.51x faster |
Key Insights:
- Our 1D block tiling implementation achieves 2.31x faster performance than NumPy
- Performance is 43% of PyTorch's highly optimized CUDA kernels (room for improvement)
- Tiled approaches consistently outperform naive implementations by 1.5-3x
Attention Kernels
Tensorax includes three hand-written CUDA attention kernels with no cuBLAS or library dependencies:
| Kernel | Technique | Best For |
|---|---|---|
| Naive | One thread per output element, three-pass softmax | Small sequences, correctness baseline |
| Tiled | Shared memory K/V tiles, online softmax | Medium sequences |
| Flash | Block Q/K/V tiling, online softmax with rescaling | Long sequences, memory efficiency |
All kernels support arbitrary batch size, head count, asymmetric sequence lengths (seq_q != seq_k), separate d_k/d_v, and optional additive attention masks.
Optimization Techniques
- โ Coalesced memory access for elementwise operations
- โ Tiled matrix multiplication with shared memory
- โ Efficient parallel reductions for sum/max operations
- โ Kernel fusion to minimize memory transfers
- โ Flash Attention with online softmax for O(1) memory in sequence length
Documentation
- Development Guide - How to contribute and develop
- Architecture Overview - System design and internals
- CI/CD Documentation - GitHub Actions workflows and automation
- Examples - Code examples and tutorials
Development
Setup development environment
# Clone repository
git clone https://github.com/NotShrirang/tensorax.git
cd tensorax
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Install in development mode
pip install -e .
Build the Extension
# Quick build (automatically detects CUDA)
bash build.sh
# Manual build (CPU only)
python setup.py build_ext --inplace
# Manual build (with CUDA)
CUDA_HOME=/usr/local/cuda python setup.py build_ext --inplace
Run Tests
# Run all tests
pytest tests/
# Run with verbose output
pytest tests/ -v
# Run with coverage report
pytest tests/ --cov=tensorax --cov-report=html --cov-report=term
# Run specific test file
pytest tests/test_tensor.py -v
Test Status
Current Status (December 9, 2025):
- โ 229 tests passing (98.9% success rate)
- ๐ก 5 tests skipped (CUDA tests - require GPU)
- ๐ด 0 tests failing
- ๐ 87% code coverage
Test Breakdown:
- Core tensor operations: 100% passing
- Neural network layers: 100% passing
- Optimizers: 100% passing
- Integration tests: 100% passing
- Functional API: 100% passing
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ Implemented Features
Core Tensor Operations โ
- Element-wise operations: add, subtract, multiply, divide, power, sqrt, abs
- Matrix operations: matmul (2D/3D batched), transpose
- Reduction operations: sum, mean, max, min, argmax, argmin
- Mathematical functions: exp, log, pow, clamp
- Shape operations: reshape, view, squeeze, unsqueeze
- Tensor creation: zeros, ones, full, randn
- Device management: CPU โ CUDA transfers with automatic fallback
- Indexing & slicing: Advanced tensor indexing and slicing
- Comparison operators: eq, lt, gt with broadcasting
- Automatic differentiation: Complete backpropagation with gradient tracking
Neural Network Layers โ
- Linear: Fully connected layer with optional bias
- Activation layers: ReLU, Sigmoid, Tanh, Softmax (with custom dim)
- Dropout: Training/eval mode with configurable drop probability
- Sequential: Container with recursive parameter collection
- Module system: Base class with parameter management, device transfer, and train/eval modes
Optimizers โ
- SGD: Stochastic Gradient Descent with momentum support
- Adam: Adaptive moment estimation with bias correction
- Learning rate: Configurable with validation
- Gradient management: zero_grad() and parameter updates
Loss Functions โ
- Mean Squared Error (MSE): For regression tasks
- Cross Entropy Loss: From probabilities or logits
- Backward pass: All loss functions support gradient computation
Attention โ
- Scaled Dot-Product Attention:
softmax(Q @ K^T / sqrt(d_k)) @ V - Three CUDA kernels: Naive, tiled (shared memory), flash (online softmax)
- CPU reference: Pure C implementation for validation and CPU-only builds
- Attention masks: Causal masks, padding masks, and custom additive masks
- Cross-attention: Supports
seq_len_q != seq_len_kandd_k != d_v
Functional API โ
- Activations: relu, sigmoid, tanh, softmax (multi-dimensional)
- Loss functions: mse_loss, cross_entropy_loss, cross_entropy_from_logits
- Attention: scaled_dot_product_attention with optional mask
- Linear transformation: Functional linear with optional bias
- Gradient support: All functions support backpropagation
๐บ๏ธ Roadmap
Completed โ
- Core tensor operations (element-wise, reduction, mathematical)
- Automatic differentiation (complete autograd system)
- Neural network layers (Linear, activations, Dropout)
- Optimizers (SGD with momentum, Adam)
- Loss functions (MSE, Cross Entropy)
- Sequential container
- Device management (CPU/CUDA)
- Comprehensive test suite (229 tests passing)
- Tensor serialization (save/load)
- Scaled dot-product attention (naive, tiled, flash CUDA kernels)
In Progress ๐ง
- Multi-head attention layer with linear projections
- CUDA kernel optimization for all operations
- Documentation improvements
- Performance benchmarking suite
Future Features ๐ฎ
- Transformer encoder/decoder blocks
- Convolution and pooling layers (Conv2D, MaxPool2D)
- Batch normalization and Layer normalization
- More activation functions (LeakyReLU, GELU, Swish, ELU)
- Additional optimizers (RMSprop, AdamW, Adagrad)
- Learning rate schedulers (StepLR, ExponentialLR, CosineAnnealing)
- Multi-GPU support with data parallelism
- Mixed precision training (FP16/BF16)
- Distributed training (DDP)
- Graph optimization and fusion
- JIT compilation for custom operations
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Inspired by PyTorch's design and API
- CUDA optimization techniques from various deep learning frameworks
- Community contributions and feedback
๐ Learning Resource
Tensorax is an excellent educational tool for understanding:
- Deep learning internals (how PyTorch/TensorFlow work under the hood)
- CUDA programming and GPU optimization
- Automatic differentiation implementation
- Building ML frameworks from scratch
- C++/Python interoperability with pybind11
Check out the examples/ directory for tutorials!
๐ Citation
If you use Tensorax in your research or project, please cite:
@software{tensorax2025,
title = {Tensorax: Pure C++/CUDA Tensor Library},
author = {NotShrirang},
year = {2025},
url = {https://github.com/NotShrirang/tensorax}
}
๐ Contact & Support
- GitHub: @NotShrirang
- Issues: Report bugs or request features
- Discussions: Ask questions
โญ Star History
If you find Tensorax useful, please consider giving it a star! โญ
Built with โค๏ธ by @NotShrirang
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tensorax-0.1.5.tar.gz.
File metadata
- Download URL: tensorax-0.1.5.tar.gz
- Upload date:
- Size: 47.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b7f803426d6a110391add3cf07acfc61a3097283981d21eb72b37f5034e388d
|
|
| MD5 |
25ea6d595f6fde488a457c543165f881
|
|
| BLAKE2b-256 |
2fbaf22ab98e10657b4c9126879bdbe93fb4d4299c059b7759564ffa2850bcb7
|