A high-performance tensor library with CUDA acceleration

These details have not been verified by PyPI

Project links

Project description

Tensorax

A high-performance tensor computation library with CUDA acceleration, designed for deep learning and numerical computing.

Built from scratch for deep learning and numerical computing with blazing-fast GPU acceleration.

✨ Features

🚀 Pure C++/CUDA Backend: No PyTorch or NumPy dependencies - truly standalone
⚡ Extreme Performance: Up to 448x speedup on GPU operations (1024×1024 matmul)
🔄 Complete Autograd: Full automatic differentiation with computational graph
🧠 PyTorch-like API: Familiar interface for easy adoption
🔧 Flexible Deployment: Works with or without CUDA - automatic fallback to CPU

🎯 Why Tensorax?

Unlike other libraries that wrap PyTorch or depend on NumPy, Tensorax is built completely from scratch:

✅ Zero heavy dependencies - Only requires pybind11 for Python bindings
✅ Production ready - Complete training pipeline with optimizers and backprop
✅ True CUDA acceleration - Hand-written kernels, not wrappers
✅ Educational - Clean, readable codebase perfect for learning DL internals

📦 Installation

Platform Support

Currently supported:

✅ Linux (Ubuntu, Debian, Fedora, etc.)
✅ macOS (Intel and Apple Silicon)

Not yet supported:

❌ Windows (coming soon - contributions welcome!)

Prerequisites

Python 3.8+
C++17 compatible compiler (g++, clang++)
CUDA Toolkit 11.0+ (optional, for GPU support)
pybind11 (automatically installed)

Quick Install

From PyPI:

pip install tensorax

From Source:

git clone https://github.com/NotShrirang/tensorax.git
cd tensorax
bash build.sh       # Automatically detects CUDA
pip install -e .

Manual Build

# CPU only
python setup.py build_ext --inplace

# With CUDA
CUDA_HOME=/usr/local/cuda python setup.py build_ext --inplace

From PyPI

pip install tensorax

🚀 Quick Start

Run the Demo

python demo.py  # Comprehensive showcase of all features

Basic Tensor Operations

from tensorax import Tensor

# Create tensors
a = Tensor([[1.0, 2.0], [3.0, 4.0]])
b = Tensor([[5.0, 6.0], [7.0, 8.0]])

# Arithmetic operations
c = a + b           # Addition
d = a - b           # Subtraction
e = a * b           # Element-wise multiplication
f = a / b           # Division
g = a @ b           # Matrix multiplication

# Tensor properties
print(a.shape)      # (2, 2)
print(a.T)          # Transpose
print(a.device)     # 'cpu' or 'cuda'

# Factory methods
zeros = Tensor.zeros((3, 3))
ones = Tensor.ones((2, 4))
rand = Tensor.randn((5, 5))

# GPU acceleration
if Tensor.cuda_is_available():
    a_gpu = a.cuda()
    b_gpu = b.cuda()
    c_gpu = a_gpu @ b_gpu  # 448x faster on 1024×1024!
    result = c_gpu.cpu()

Automatic Differentiation

from tensorax import Tensor

# Create tensors with gradient tracking
x = Tensor([[2.0]], requires_grad=True)
w = Tensor([[3.0]], requires_grad=True)
b = Tensor([[1.0]], requires_grad=True)

# Forward pass
y = w * x + b  # y = 3*2 + 1 = 7

# Backward pass
y.backward()

# Gradients
print(x.grad)  # dy/dx = 3
print(w.grad)  # dy/dw = 2
print(b.grad)  # dy/db = 1

Neural Networks & Training

from tensorax import nn, Tensor, optim, functional as F

# Define a model
model = nn.Sequential(
    nn.Linear(4, 8),
    nn.ReLU(),
    nn.Linear(8, 3),
    nn.Sigmoid()
)

# Create optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(100):
    # Forward pass
    output = model(x_train)
    loss = F.mse_loss(output, y_train)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}: Loss = {loss.tolist()[0]:.4f}')

Scaled Dot-Product Attention

from tensorax import Tensor, functional as F
from tensorax.nn.attention import ScaledDotProductAttention, create_causal_mask

batch, heads, seq_len, d_k = 2, 8, 64, 64

Q = Tensor.randn((batch, heads, seq_len, d_k))
K = Tensor.randn((batch, heads, seq_len, d_k))
V = Tensor.randn((batch, heads, seq_len, d_k))

# Basic attention
out = F.scaled_dot_product_attention(Q, K, V)

# Causal (autoregressive) attention
mask = create_causal_mask(seq_len, batch_size=batch, num_heads=heads)
out = F.scaled_dot_product_attention(Q, K, V, mask=mask)

# Layer-based usage
attn = ScaledDotProductAttention()
out = attn(Q, K, V, mask=mask)

# GPU acceleration
if Tensor.cuda_is_available():
    out = F.scaled_dot_product_attention(Q.cuda(), K.cuda(), V.cuda())

Functional API

from tensorax import functional as F, Tensor

x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]])

# Activation functions
y1 = F.relu(x)      # [0.0, 0.0, 0.0, 1.0, 2.0]
y2 = F.sigmoid(x)   # [0.119, 0.269, 0.5, 0.731, 0.881]
y3 = F.tanh(x)      # [-0.964, -0.762, 0.0, 0.762, 0.964]
y4 = F.softmax(x, dim=-1)  # Normalized probabilities

# Loss functions
pred = Tensor([[2.0, 1.5, 3.0]])
target = Tensor([[2.5, 2.0, 2.5]])
loss = F.mse_loss(pred, target)  # Mean squared error

Project Structure

tensorax/
├── csrc/                      # C++ and CUDA source code
│   ├── cuda/kernels/         # CUDA kernel implementations
│   │   ├── elementwise.cu    # Element-wise operations
│   │   ├── reduction.cu      # Sum, mean, max reductions
│   │   ├── matmul.cu         # Matrix multiplication (6 variants)
│   │   └── attn.cu           # Attention kernels (naive, tiled, flash)
│   ├── cpu/                  # CPU implementations
│   └── tensor_ops.*          # Core operations and pybind11 bindings
├── tensorax/                  # Python package
│   ├── tensor.py             # Tensor class
│   ├── functional.py         # Functional API (relu, softmax, sdpa, ...)
│   ├── nn/                   # Neural network modules
│   │   └── attention/        # Attention layers and utilities
│   └── optim.py              # Optimizers
├── tests/                    # Test suite
├── examples/                 # Usage examples
└── docs/                     # Documentation

⚡ Performance

Tensorax uses hand-optimized CUDA kernels for maximum performance. Here are some benchmark results for matrix multiplication (fp32, 3x1024×1024):

Matrix Multiplication Benchmark (100 runs)

Comparison of different CUDA kernel implementations vs NumPy and PyTorch:

Implementation	Time (seconds)	Relative Performance
1D Block Tiling (Best)	0.95	2.31x faster
Tiled Matrix Multiply	1.22	1.80x faster
NumPy (CPU)	1.85	Baseline (CPU)
Shared Memory Cache Blocking	2.18	0.85x
Default CUDA	3.37	0.55x
Shared Memory Coalescing	3.44	0.54x
PyTorch CUDA (Reference)	0.41	4.51x faster

Key Insights:

Our 1D block tiling implementation achieves 2.31x faster performance than NumPy
Performance is 43% of PyTorch's highly optimized CUDA kernels (room for improvement)
Tiled approaches consistently outperform naive implementations by 1.5-3x

Attention Kernels

Tensorax includes three hand-written CUDA attention kernels with no cuBLAS or library dependencies:

Kernel	Technique	Best For
Naive	One thread per output element, three-pass softmax	Small sequences, correctness baseline
Tiled	Shared memory K/V tiles, online softmax	Medium sequences
Flash	Block Q/K/V tiling, online softmax with rescaling	Long sequences, memory efficiency

All kernels support arbitrary batch size, head count, asymmetric sequence lengths (seq_q != seq_k), separate d_k/d_v, and optional additive attention masks.

Optimization Techniques

✅ Coalesced memory access for elementwise operations
✅ Tiled matrix multiplication with shared memory
✅ Efficient parallel reductions for sum/max operations
✅ Kernel fusion to minimize memory transfers
✅ Flash Attention with online softmax for O(1) memory in sequence length

Documentation

Development Guide - How to contribute and develop
Architecture Overview - System design and internals
CI/CD Documentation - GitHub Actions workflows and automation
Examples - Code examples and tutorials

Development

Setup development environment

# Clone repository
git clone https://github.com/NotShrirang/tensorax.git
cd tensorax

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Install in development mode
pip install -e .

Build the Extension

# Quick build (automatically detects CUDA)
bash build.sh

# Manual build (CPU only)
python setup.py build_ext --inplace

# Manual build (with CUDA)
CUDA_HOME=/usr/local/cuda python setup.py build_ext --inplace

Run Tests

# Run all tests
pytest tests/

# Run with verbose output
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=tensorax --cov-report=html --cov-report=term

# Run specific test file
pytest tests/test_tensor.py -v

Test Status

Current Status (December 9, 2025):

✅ 229 tests passing (98.9% success rate)
🟡 5 tests skipped (CUDA tests - require GPU)
🔴 0 tests failing
📊 87% code coverage

Test Breakdown:

Core tensor operations: 100% passing
Neural network layers: 100% passing
Optimizers: 100% passing
Integration tests: 100% passing
Functional API: 100% passing

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📋 Implemented Features

Core Tensor Operations ✅

Element-wise operations: add, subtract, multiply, divide, power, sqrt, abs
Matrix operations: matmul (2D/3D batched), transpose
Reduction operations: sum, mean, max, min, argmax, argmin
Mathematical functions: exp, log, pow, clamp
Shape operations: reshape, view, squeeze, unsqueeze
Tensor creation: zeros, ones, full, randn
Device management: CPU ↔ CUDA transfers with automatic fallback
Indexing & slicing: Advanced tensor indexing and slicing
Comparison operators: eq, lt, gt with broadcasting
Automatic differentiation: Complete backpropagation with gradient tracking

Neural Network Layers ✅

Linear: Fully connected layer with optional bias
Activation layers: ReLU, Sigmoid, Tanh, Softmax (with custom dim)
Dropout: Training/eval mode with configurable drop probability
Sequential: Container with recursive parameter collection
Module system: Base class with parameter management, device transfer, and train/eval modes

Optimizers ✅

SGD: Stochastic Gradient Descent with momentum support
Adam: Adaptive moment estimation with bias correction
Learning rate: Configurable with validation
Gradient management: zero_grad() and parameter updates

Loss Functions ✅

Mean Squared Error (MSE): For regression tasks
Cross Entropy Loss: From probabilities or logits
Backward pass: All loss functions support gradient computation

Attention ✅

Scaled Dot-Product Attention: softmax(Q @ K^T / sqrt(d_k)) @ V
Three CUDA kernels: Naive, tiled (shared memory), flash (online softmax)
CPU reference: Pure C implementation for validation and CPU-only builds
Attention masks: Causal masks, padding masks, and custom additive masks
Cross-attention: Supports seq_len_q != seq_len_k and d_k != d_v

Functional API ✅

Activations: relu, sigmoid, tanh, softmax (multi-dimensional)
Loss functions: mse_loss, cross_entropy_loss, cross_entropy_from_logits
Attention: scaled_dot_product_attention with optional mask
Linear transformation: Functional linear with optional bias
Gradient support: All functions support backpropagation

🗺️ Roadmap

Completed ✅

Core tensor operations (element-wise, reduction, mathematical)
Automatic differentiation (complete autograd system)
Neural network layers (Linear, activations, Dropout)
Optimizers (SGD with momentum, Adam)
Loss functions (MSE, Cross Entropy)
Sequential container
Device management (CPU/CUDA)
Comprehensive test suite (229 tests passing)
Tensor serialization (save/load)
Scaled dot-product attention (naive, tiled, flash CUDA kernels)

In Progress 🚧

Multi-head attention layer with linear projections
CUDA kernel optimization for all operations
Documentation improvements
Performance benchmarking suite

Future Features 🔮

Transformer encoder/decoder blocks
Convolution and pooling layers (Conv2D, MaxPool2D)
Batch normalization and Layer normalization
More activation functions (LeakyReLU, GELU, Swish, ELU)
Additional optimizers (RMSprop, AdamW, Adagrad)
Learning rate schedulers (StepLR, ExponentialLR, CosineAnnealing)
Multi-GPU support with data parallelism
Mixed precision training (FP16/BF16)
Distributed training (DDP)
Graph optimization and fusion
JIT compilation for custom operations

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by PyTorch's design and API
CUDA optimization techniques from various deep learning frameworks
Community contributions and feedback

🎓 Learning Resource

Tensorax is an excellent educational tool for understanding:

Deep learning internals (how PyTorch/TensorFlow work under the hood)
CUDA programming and GPU optimization
Automatic differentiation implementation
Building ML frameworks from scratch
C++/Python interoperability with pybind11

Check out the examples/ directory for tutorials!

📄 Citation

If you use Tensorax in your research or project, please cite:

@software{tensorax2025,
  title = {Tensorax: Pure C++/CUDA Tensor Library},
  author = {NotShrirang},
  year = {2025},
  url = {https://github.com/NotShrirang/tensorax}
}

📞 Contact & Support

GitHub: @NotShrirang
Issues: Report bugs or request features
Discussions: Ask questions

⭐ Star History

If you find Tensorax useful, please consider giving it a star! ⭐

Built with ❤️ by @NotShrirang

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

May 7, 2026

0.3.0

May 1, 2026

0.2.1

Apr 15, 2026

0.2.0

Mar 27, 2026

0.1.7

Mar 27, 2026

0.1.6

Mar 13, 2026

This version

0.1.5

Feb 7, 2026

0.1.4

Feb 6, 2026

0.1.3

Dec 25, 2025

0.1.2

Dec 19, 2025

0.1.1

Dec 15, 2025

0.1.0

Dec 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensorax-0.1.5.tar.gz (47.3 kB view details)

Uploaded Feb 7, 2026 Source

File details

Details for the file tensorax-0.1.5.tar.gz.

File metadata

Download URL: tensorax-0.1.5.tar.gz
Upload date: Feb 7, 2026
Size: 47.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tensorax-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`2b7f803426d6a110391add3cf07acfc61a3097283981d21eb72b37f5034e388d`
MD5	`25ea6d595f6fde488a457c543165f881`
BLAKE2b-256	`2fbaf22ab98e10657b4c9126879bdbe93fb4d4299c059b7759564ffa2850bcb7`

See more details on using hashes here.

tensorax 0.1.5

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Tensorax

✨ Features

🎯 Why Tensorax?

📦 Installation

Platform Support

Prerequisites

Quick Install

Manual Build

From PyPI

🚀 Quick Start

Run the Demo

Basic Tensor Operations

Automatic Differentiation

Neural Networks & Training

Scaled Dot-Product Attention

Functional API

Project Structure

⚡ Performance

Matrix Multiplication Benchmark (100 runs)

Attention Kernels

Optimization Techniques

Documentation

Development

Setup development environment

Build the Extension

Run Tests

Test Status

Contributing

📋 Implemented Features

Core Tensor Operations ✅

Neural Network Layers ✅

Optimizers ✅

Loss Functions ✅

Attention ✅

Functional API ✅

🗺️ Roadmap

Completed ✅

In Progress 🚧

Future Features 🔮

License

Acknowledgments

🎓 Learning Resource

📄 Citation

📞 Contact & Support

⭐ Star History

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes