High-Performance BLAS Library by OktoSeek - Tensor Core GEMM and Fused Attention
Project description
OktoBLAS by OktoSeek
🚀 High-Performance BLAS Library | ⚡ Tensor Core Acceleration | 🔥 100% Independent
OktoBLAS is a high-performance, fully independent BLAS library built from scratch in Rust + CUDA PTX, with no cuBLAS dependency.
🔧 Installation
pip install oktoblas
📖 Quick Start
import oktoblas as ob
import numpy as np
# Matrix multiplication
A = np.random.randn(2048, 2048).astype(np.float32)
B = np.random.randn(2048, 2048).astype(np.float32)
C = ob.matmul(A, B)
# FP16 with Tensor Cores
A16 = np.random.randn(2048, 2048).astype(np.float16)
B16 = np.random.randn(2048, 2048).astype(np.float16)
C16 = ob.matmul_fp16(A16, B16)
# Fused Attention
batch, seq_len, head_dim = 4, 512, 64
Q = np.random.randn(batch, seq_len, head_dim).astype(np.float32)
K = np.random.randn(batch, seq_len, head_dim).astype(np.float32)
V = np.random.randn(batch, seq_len, head_dim).astype(np.float32)
output = ob.attention(Q, K, V)
# Show info
ob.info()
🔥 PyTorch Integration
import torch
import oktoblas as ob
# Use OktoBLAS with PyTorch tensors
A = torch.randn(2048, 2048, device='cuda', dtype=torch.float16)
B = torch.randn(2048, 2048, device='cuda', dtype=torch.float16)
C = ob.matmul_fp16(A.cpu().numpy(), B.cpu().numpy())
🎯 Features
| Feature | Description |
|---|---|
| FP16/FP32 GEMM | Tensor Core acceleration |
| Fused Attention | Single kernel Q×K×V |
| 100% Independent | No cuBLAS dependency |
| Hand-Tuned PTX | Optimized CUDA kernels |
📊 Benchmark Results (RTX 4070 Laptop)
All benchmarks validated using CUDA Events.
FP16 GEMM (Tensor Cores)
| Matrix Size | OktoBLAS | PyTorch | Ratio |
|---|---|---|---|
| 1024×1024 | 29.1 TF | 23.3 TF | 125% |
| 2048×2048 | 35.1 TF | 34.6 TF | 101% |
| 4096×4096 | 36.5 TF | 38.9 TF | 94% |
Fused Attention
| Config | OktoBLAS | PyTorch | Ratio |
|---|---|---|---|
| B4 S256 D64 | 0.96 TF | 0.28 TF | 346% |
| B4 S512 D64 | 1.22 TF | 0.93 TF | 131% |
🚀 Roadmap
- FP16/FP32 GEMM with Tensor Cores
- Fused Attention kernel
- PyPI package release
- ROCm (AMD) support
- Metal (Apple) support
- Full PyTorch autograd integration
📚 Part of OktoSeek Ecosystem
OktoBLAS is part of the OktoSeek ecosystem:
| Project | Description | Link |
|---|---|---|
| OktoScript | AI programming language | GitHub |
| OktoEngine | Native ML inference engine | Coming soon |
| OktoStudio | AI Development IDE | Coming soon |
| OktoBLAS | High-performance BLAS | GitHub |
| OkTensor | GPU tensor library | Part of OktoEngine |
📜 License
Proprietary License - Free for personal and commercial use.
Copyright (c) 2025 OktoSeek AI. All Rights Reserved.
See LICENSE.txt for details.
🙏 Credits
Built with ❤️ by OktoSeek AI.
- Website: https://www.oktoseek.com
- GitHub: https://github.com/oktoseek
- Twitter: https://x.com/oktoseek
⭐ Star us on GitHub!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oktoblas-1.0.1-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: oktoblas-1.0.1-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 198.4 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
269799967bf1df3539c06d143d5749250d6a32f80949a63d1c94ca48f5838f0c
|
|
| MD5 |
9a5968c8e621af04d6b6567830a5f260
|
|
| BLAKE2b-256 |
b5bdb6eaf6bf22f5ba5b6fd4f29efd84fecb47cdac868a93588171b13d1df770
|