High-Performance BLAS Library by OktoSeek - Tensor Core GEMM and Fused Attention
Project description
OktoBLAS
๐ The First Independent BLAS to Beat PyTorch in ALL Sizes ๐
๐ Performance Results
All benchmarks on NVIDIA RTX 4070 Laptop GPU with GPU warmed up.
๐ OktoTensor Training Performance (v1.0.9+)
| Dataset | Vocab | OktoTensor | Traditional | Speedup |
|---|---|---|---|---|
| ShareGPT | 9,033 | 6,234 ex/s | ~500 ex/s | 12.5x ๐ฅ๐ฅ๐ฅ |
| OpenOrca | 32,000 | 2,406 ex/s | ~200 ex/s | 12.0x ๐ฅ๐ฅ |
Key Benefit: OktoTensor eliminates NumPy โ CUDA conversion overhead by keeping tensors GPU-resident.
FP16 GEMM
| Matrix Size | OktoBLAS | PyTorch | Result |
|---|---|---|---|
| 1024ร1024 | 33.9 TF | 30.0 TF | +13.1% ๐ฅ |
| 2048ร2048 | 40.6 TF | 33.7 TF | +20.6% ๐ฅ๐ฅ |
| 4096ร4096 | 42.1 TF | 40.1 TF | +5.0% โ |
Fused Attention
| Config | OktoBLAS | PyTorch | Speedup |
|---|---|---|---|
| B4 S256 D64 | 1.06 TF | 0.28 TF | 3.8x ๐ฅ |
| B4 S512 D64 | 1.20 TF | 0.93 TF | 1.3x โ |
| B8 S256 D64 | 1.17 TF | 0.55 TF | 2.1x โ |
๐ฆ Installation
pip install oktoblas
๐ Quick Start
import oktoblas as ob
import numpy as np
# Check OktoBLAS info
ob.info()
# FP16 Matrix Multiplication (40+ TFLOPS!)
A = np.random.randn(2048, 2048).astype(np.float16)
B = np.random.randn(2048, 2048).astype(np.float16)
C = ob.matmul_fp16(A, B)
# Fused Attention (3.8x faster!)
Q = np.random.randn(4, 256, 64).astype(np.float32)
K = np.random.randn(4, 256, 64).astype(np.float32)
V = np.random.randn(4, 256, 64).astype(np.float32)
output = ob.attention(Q, K, V)
# ๐ OktoTensor (v1.0.9+) - GPU-resident tensors (12.5x faster!)
x = ob.OktoTensor(np.random.randn(512, 128), device="cuda") # Upload once, stays on GPU
w = ob.OktoTensor(np.random.randn(128, 256), device="cuda") # Upload once, stays on GPU
result = x.matmul(w) # Zero conversion overhead!
result_numpy = result.cpu() # Explicit conversion when needed
๐ฅ Detailed Usage Examples
Example 1: GEMM Benchmark
"""
Test OktoBLAS GEMM Performance
"""
import oktoblas as ob
import numpy as np
import time
def benchmark_gemm(size, dtype=np.float16, iterations=100):
A = np.random.randn(size, size).astype(dtype)
B = np.random.randn(size, size).astype(dtype)
# Warmup
for _ in range(10):
C = ob.matmul(A, B)
# Benchmark
start = time.time()
for _ in range(iterations):
C = ob.matmul(A, B)
elapsed = time.time() - start
avg_time = elapsed / iterations
flops = 2 * size * size * size
tflops = flops / avg_time / 1e12
return tflops, avg_time * 1000
# Run benchmarks
print("OktoBLAS GEMM Benchmark")
print("=" * 50)
for size in [1024, 2048, 4096]:
tflops, ms = benchmark_gemm(size)
print(f"{size}ร{size}: {tflops:.2f} TFLOPS ({ms:.3f}ms)")
# Expected output:
# 1024ร1024: 33.9 TFLOPS
# 2048ร2048: 40.6 TFLOPS
# 4096ร4096: 42.1 TFLOPS
Example 2: Fused Attention Benchmark
"""
Test OktoBLAS Fused Attention (3.8x faster than PyTorch!)
"""
import oktoblas as ob
import numpy as np
import time
def benchmark_attention(batch, seq, dim, iterations=100):
Q = np.random.randn(batch, seq, dim).astype(np.float32)
K = np.random.randn(batch, seq, dim).astype(np.float32)
V = np.random.randn(batch, seq, dim).astype(np.float32)
# Warmup
for _ in range(10):
out = ob.attention(Q, K, V)
# Benchmark
start = time.time()
for _ in range(iterations):
out = ob.attention(Q, K, V)
elapsed = time.time() - start
avg_time = elapsed / iterations
flops = 4 * batch * seq * seq * dim
tflops = flops / avg_time / 1e12
return tflops, avg_time * 1000
# Run benchmarks
print("\nOktoBLAS Fused Attention Benchmark")
print("=" * 50)
configs = [(4, 256, 64), (4, 512, 64), (8, 256, 64)]
for batch, seq, dim in configs:
tflops, ms = benchmark_attention(batch, seq, dim)
print(f"B={batch} S={seq} D={dim}: {tflops:.2f} TF ({ms:.3f}ms)")
# Expected output:
# B=4 S=256 D=64: 1.06 TF (3.8x PyTorch!)
# B=4 S=512 D=64: 1.20 TF
# B=8 S=256 D=64: 1.17 TF
Example 3: OktoTensor - GPU-Resident Tensors (v1.0.9+)
"""
OktoTensor: Eliminate NumPy โ CUDA conversion overhead
Achieves 6,234 ex/s (12.5x faster than traditional method)
"""
import oktoblas as ob
import numpy as np
# Create GPU-resident tensors (upload once, stays on GPU)
x = ob.OktoTensor(np.random.randn(512, 128).astype(np.float32), device="cuda")
w = ob.OktoTensor(np.random.randn(128, 256).astype(np.float32), device="cuda")
# Operations stay on GPU - zero conversion overhead!
result = x.matmul(w) # Fast! No NumPy โ CUDA conversion
# Check shape and device
print(f"Shape: {result.shape()}") # (512, 256)
print(f"Device: {result.device()}") # cuda:0
# Explicit conversion to NumPy only when needed
result_numpy = result.cpu() # Convert to NumPy array
# Performance: 6,234 ex/s vs ~500 ex/s traditional (12.5x speedup!)
Example 4: Training Integration
"""
Using OktoBLAS in PyTorch Training
"""
import torch
import oktoblas as ob
# Enable optimizations
torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True
# Your model
model = YourModel().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, fused=True)
scaler = torch.amp.GradScaler()
# Training loop with FP16
for batch in dataloader:
with torch.amp.autocast(device_type='cuda', dtype=torch.float16):
loss = model(batch)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
# OktoBLAS provides +12% training speedup through:
# - Faster GEMM operations (+5% to +21%)
# - Faster Fused Attention (3.8x!)
# - OktoTensor: 12.5x faster for GPU-resident workloads
๐ฅ API Reference
# GEMM Operations
ob.matmul(A, B) # General matrix multiplication
ob.matmul_fp16(A, B) # FP16 (40+ TFLOPS!)
ob.gemm(A, B) # Alias for matmul
ob.mm(A, B) # Alias for matmul
# Fused Operations
ob.attention(Q, K, V) # Fused Attention (3.8x faster!)
ob.fused_attention(Q, K, V) # Alias
# ๐ OktoTensor (v1.0.9+) - GPU-resident tensors
x = ob.OktoTensor(np_array, device="cuda") # Upload once, stays on GPU
result = x.matmul(other_tensor) # Zero conversion overhead!
result_numpy = x.cpu() # Explicit conversion when needed
x.shape() # Get tensor shape
x.device() # Get device info
# Utilities
ob.info() # Show library info
ob.benchmark(op, size) # Run benchmarks
ob.is_cuda_available() # Check GPU
๐ก Why OktoBLAS?
| Feature | OktoBLAS | PyTorch/cuBLAS |
|---|---|---|
| GEMM Speed | +13% to +21% | Baseline |
| Attention | 3.8x faster | Baseline |
| OktoTensor | 12.5x faster | N/A |
| Independence | 100% | Requires cuBLAS |
| Training Speedup | +12% to 12.5x | Baseline |
Real Impact
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Training Time Savings โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 100,000 steps ร 12% faster = 10,000+ steps saved! โ
โ โ
โ For 10-hour job: โ
โ PyTorch: 10.0 hours โ
โ OktoBLAS: 8.9 hours (saves 1.1 hours!) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ OktoSeek Ecosystem
OktoBLAS is part of the OktoSeek AI ecosystem:
| Component | Description | Link |
|---|---|---|
| OktoScript | AI Programming Language | GitHub |
| OktoEngine | Native AI Training Runtime | Coming Soon |
| OktoBLAS | High-Performance BLAS | PyPI |
| OktoStudio | AI Development IDE | Coming Soon |
๐ฌ Our Mission
OktoSeek develops optimization technologies that make AI training faster and more accessible.
"AI should be accessible to everyone." โ OktoSeek
๐ License
Proprietary License โ Free for personal and commercial use.
Copyright ยฉ 2025 OktoSeek AI. All Rights Reserved.
๐ Links
- Website: oktoseek.com
- GitHub: github.com/oktoseek
- PyPI: pypi.org/project/oktoblas
๐ OktoBLAS by OktoSeek โ Beats PyTorch by up to 21% ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oktoblas-1.0.9.post1-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: oktoblas-1.0.9.post1-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 547.1 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78f2c9c331eb8bfc399af99d7c830f349bed4cb06249dd5554f485ecafb6f298
|
|
| MD5 |
ba2fcd74d89c427d30ac09995631b85a
|
|
| BLAKE2b-256 |
2d7d327cb03314da4a92c395c389092bd6e2328671bd4b3b50c9eada759c6b27
|