High-Performance BLAS Library by OktoSeek - Tensor Core GEMM and Fused Attention

These details have not been verified by PyPI

Project links

Project description

OktoBLAS

🏆 The First Independent BLAS to Beat PyTorch in ALL Sizes 🏆

🏆 Performance Results

All benchmarks on NVIDIA RTX 4070 Laptop GPU with GPU warmed up.

🚀 OktoTensor Training Performance (v1.0.9+)

Dataset	Vocab	OktoTensor	Traditional	Speedup
ShareGPT	9,033	6,234 ex/s	~500 ex/s	12.5x 🔥🔥🔥
OpenOrca	32,000	2,406 ex/s	~200 ex/s	12.0x 🔥🔥

Key Benefit: OktoTensor eliminates NumPy ↔ CUDA conversion overhead by keeping tensors GPU-resident.

FP16 GEMM

Matrix Size	OktoBLAS	PyTorch	Result
1024×1024	33.9 TF	30.0 TF	+13.1% 🔥
2048×2048	40.6 TF	33.7 TF	+20.6% 🔥🔥
4096×4096	42.1 TF	40.1 TF	+5.0% ✅

Fused Attention

Config	OktoBLAS	PyTorch	Speedup
B4 S256 D64	1.06 TF	0.28 TF	3.8x 🔥
B4 S512 D64	1.20 TF	0.93 TF	1.3x ✅
B8 S256 D64	1.17 TF	0.55 TF	2.1x ✅

📦 Installation

pip install oktoblas

📖 Quick Start

import oktoblas as ob
import numpy as np

# Check OktoBLAS info
ob.info()

# FP16 Matrix Multiplication (40+ TFLOPS!)
A = np.random.randn(2048, 2048).astype(np.float16)
B = np.random.randn(2048, 2048).astype(np.float16)
C = ob.matmul_fp16(A, B)

# Fused Attention (3.8x faster!)
Q = np.random.randn(4, 256, 64).astype(np.float32)
K = np.random.randn(4, 256, 64).astype(np.float32)
V = np.random.randn(4, 256, 64).astype(np.float32)
output = ob.attention(Q, K, V)

# 🚀 OktoTensor (v1.0.9+) - GPU-resident tensors (12.5x faster!)
x = ob.OktoTensor(np.random.randn(512, 128), device="cuda")  # Upload once, stays on GPU
w = ob.OktoTensor(np.random.randn(128, 256), device="cuda")  # Upload once, stays on GPU
result = x.matmul(w)  # Zero conversion overhead!
result_numpy = result.cpu()  # Explicit conversion when needed

🔥 Detailed Usage Examples

Example 1: GEMM Benchmark

"""
Test OktoBLAS GEMM Performance
"""
import oktoblas as ob
import numpy as np
import time

def benchmark_gemm(size, dtype=np.float16, iterations=100):
    A = np.random.randn(size, size).astype(dtype)
    B = np.random.randn(size, size).astype(dtype)
    
    # Warmup
    for _ in range(10):
        C = ob.matmul(A, B)
    
    # Benchmark
    start = time.time()
    for _ in range(iterations):
        C = ob.matmul(A, B)
    elapsed = time.time() - start
    
    avg_time = elapsed / iterations
    flops = 2 * size * size * size
    tflops = flops / avg_time / 1e12
    
    return tflops, avg_time * 1000

# Run benchmarks
print("OktoBLAS GEMM Benchmark")
print("=" * 50)

for size in [1024, 2048, 4096]:
    tflops, ms = benchmark_gemm(size)
    print(f"{size}×{size}: {tflops:.2f} TFLOPS ({ms:.3f}ms)")

# Expected output:
# 1024×1024: 33.9 TFLOPS
# 2048×2048: 40.6 TFLOPS  
# 4096×4096: 42.1 TFLOPS

Example 2: Fused Attention Benchmark

"""
Test OktoBLAS Fused Attention (3.8x faster than PyTorch!)
"""
import oktoblas as ob
import numpy as np
import time

def benchmark_attention(batch, seq, dim, iterations=100):
    Q = np.random.randn(batch, seq, dim).astype(np.float32)
    K = np.random.randn(batch, seq, dim).astype(np.float32)
    V = np.random.randn(batch, seq, dim).astype(np.float32)
    
    # Warmup
    for _ in range(10):
        out = ob.attention(Q, K, V)
    
    # Benchmark
    start = time.time()
    for _ in range(iterations):
        out = ob.attention(Q, K, V)
    elapsed = time.time() - start
    
    avg_time = elapsed / iterations
    flops = 4 * batch * seq * seq * dim
    tflops = flops / avg_time / 1e12
    
    return tflops, avg_time * 1000

# Run benchmarks
print("\nOktoBLAS Fused Attention Benchmark")
print("=" * 50)

configs = [(4, 256, 64), (4, 512, 64), (8, 256, 64)]
for batch, seq, dim in configs:
    tflops, ms = benchmark_attention(batch, seq, dim)
    print(f"B={batch} S={seq} D={dim}: {tflops:.2f} TF ({ms:.3f}ms)")

# Expected output:
# B=4 S=256 D=64: 1.06 TF (3.8x PyTorch!)
# B=4 S=512 D=64: 1.20 TF
# B=8 S=256 D=64: 1.17 TF

Example 3: OktoTensor - GPU-Resident Tensors (v1.0.9+)

"""
OktoTensor: Eliminate NumPy ↔ CUDA conversion overhead
Achieves 6,234 ex/s (12.5x faster than traditional method)
"""
import oktoblas as ob
import numpy as np

# Create GPU-resident tensors (upload once, stays on GPU)
x = ob.OktoTensor(np.random.randn(512, 128).astype(np.float32), device="cuda")
w = ob.OktoTensor(np.random.randn(128, 256).astype(np.float32), device="cuda")

# Operations stay on GPU - zero conversion overhead!
result = x.matmul(w)  # Fast! No NumPy ↔ CUDA conversion

# Check shape and device
print(f"Shape: {result.shape()}")  # (512, 256)
print(f"Device: {result.device()}")  # cuda:0

# Explicit conversion to NumPy only when needed
result_numpy = result.cpu()  # Convert to NumPy array

# Performance: 6,234 ex/s vs ~500 ex/s traditional (12.5x speedup!)

Example 4: Training Integration

"""
Using OktoBLAS in PyTorch Training
"""
import torch
import oktoblas as ob

# Enable optimizations
torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True

# Your model
model = YourModel().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, fused=True)
scaler = torch.amp.GradScaler()

# Training loop with FP16
for batch in dataloader:
    with torch.amp.autocast(device_type='cuda', dtype=torch.float16):
        loss = model(batch)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

# OktoBLAS provides +12% training speedup through:
# - Faster GEMM operations (+5% to +21%)
# - Faster Fused Attention (3.8x!)
# - OktoTensor: 12.5x faster for GPU-resident workloads

🔥 API Reference

# GEMM Operations
ob.matmul(A, B)           # General matrix multiplication
ob.matmul_fp16(A, B)      # FP16 (40+ TFLOPS!)
ob.gemm(A, B)             # Alias for matmul
ob.mm(A, B)               # Alias for matmul

# Fused Operations  
ob.attention(Q, K, V)     # Fused Attention (3.8x faster!)
ob.fused_attention(Q, K, V)  # Alias

# 🚀 OktoTensor (v1.0.9+) - GPU-resident tensors
x = ob.OktoTensor(np_array, device="cuda")  # Upload once, stays on GPU
result = x.matmul(other_tensor)  # Zero conversion overhead!
result_numpy = x.cpu()  # Explicit conversion when needed
x.shape()  # Get tensor shape
x.device()  # Get device info

# Utilities
ob.info()                 # Show library info
ob.benchmark(op, size)    # Run benchmarks
ob.is_cuda_available()    # Check GPU

💡 Why OktoBLAS?

Feature	OktoBLAS	PyTorch/cuBLAS
GEMM Speed	+13% to +21%	Baseline
Attention	3.8x faster	Baseline
OktoTensor	12.5x faster	N/A
Independence	100%	Requires cuBLAS
Training Speedup	+12% to 12.5x	Baseline

Real Impact

┌─────────────────────────────────────────────────────────────────────┐
│                    Training Time Savings                            │
├─────────────────────────────────────────────────────────────────────┤
│   100,000 steps × 12% faster = 10,000+ steps saved!                 │
│                                                                     │
│   For 10-hour job:                                                  │
│   PyTorch:     10.0 hours                                           │
│   OktoBLAS:    8.9 hours (saves 1.1 hours!)                         │
└─────────────────────────────────────────────────────────────────────┘

🌐 OktoSeek Ecosystem

OktoBLAS is part of the OktoSeek AI ecosystem:

Component	Description	Link
OktoScript	AI Programming Language	GitHub
OktoEngine	Native AI Training Runtime	Coming Soon
OktoBLAS	High-Performance BLAS	PyPI
OktoStudio	AI Development IDE	Coming Soon

🔬 Our Mission

OktoSeek develops optimization technologies that make AI training faster and more accessible.

"AI should be accessible to everyone." — OktoSeek

📜 License

Proprietary License — Free for personal and commercial use.

🔗 Links

Website: oktoseek.com
GitHub: github.com/oktoseek
PyPI: pypi.org/project/oktoblas

🏆 OktoBLAS by OktoSeek — Beats PyTorch by up to 21% 🏆

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.9.post1

Dec 6, 2025

1.0.9

Dec 6, 2025

1.0.8

Dec 6, 2025

1.0.7

Dec 6, 2025

1.0.6

Dec 5, 2025

1.0.5

Dec 5, 2025

1.0.4

Dec 5, 2025

1.0.3

Dec 5, 2025

1.0.2

Dec 5, 2025

1.0.1

Dec 5, 2025

1.0.0

Dec 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oktoblas-1.0.9.post1-cp310-cp310-win_amd64.whl (547.1 kB view details)

Uploaded Dec 6, 2025 CPython 3.10Windows x86-64

File details

Details for the file oktoblas-1.0.9.post1-cp310-cp310-win_amd64.whl.

File metadata

Download URL: oktoblas-1.0.9.post1-cp310-cp310-win_amd64.whl
Upload date: Dec 6, 2025
Size: 547.1 kB
Tags: CPython 3.10, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.10.2

File hashes

Hashes for oktoblas-1.0.9.post1-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`78f2c9c331eb8bfc399af99d7c830f349bed4cb06249dd5554f485ecafb6f298`
MD5	`ba2fcd74d89c427d30ac09995631b85a`
BLAKE2b-256	`2d7d327cb03314da4a92c395c389092bd6e2328671bd4b3b50c9eada759c6b27`

See more details on using hashes here.

oktoblas 1.0.9.post1

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OktoBLAS

🏆 Performance Results

🚀 OktoTensor Training Performance (v1.0.9+)

FP16 GEMM

Fused Attention

📦 Installation

📖 Quick Start

🔥 Detailed Usage Examples

Example 1: GEMM Benchmark

Example 2: Fused Attention Benchmark

Example 3: OktoTensor - GPU-Resident Tensors (v1.0.9+)

Example 4: Training Integration

🔥 API Reference

💡 Why OktoBLAS?

Real Impact

🌐 OktoSeek Ecosystem

🔬 Our Mission

📜 License

🔗 Links

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes