Skip to main content

High-Performance BLAS Library by OktoSeek - Tensor Core GEMM and Fused Attention

Project description

OktoBLAS by OktoSeek

🚀 High-Performance BLAS Library | ⚡ Tensor Core Acceleration | 🔥 100% Independent

OktoBLAS is a high-performance, fully independent BLAS library built from scratch in Rust + CUDA PTX, with no cuBLAS dependency.


🔧 Installation

pip install oktoblas

📖 Quick Start

import oktoblas as ob
import numpy as np

# Matrix multiplication
A = np.random.randn(2048, 2048).astype(np.float32)
B = np.random.randn(2048, 2048).astype(np.float32)
C = ob.matmul(A, B)

# FP16 with Tensor Cores
A16 = np.random.randn(2048, 2048).astype(np.float16)
B16 = np.random.randn(2048, 2048).astype(np.float16)
C16 = ob.matmul_fp16(A16, B16)

# Fused Attention
batch, seq_len, head_dim = 4, 512, 64
Q = np.random.randn(batch, seq_len, head_dim).astype(np.float32)
K = np.random.randn(batch, seq_len, head_dim).astype(np.float32)
V = np.random.randn(batch, seq_len, head_dim).astype(np.float32)
output = ob.attention(Q, K, V)

# Show info
ob.info()

🔥 PyTorch Integration

import torch
import oktoblas as ob

# Use OktoBLAS with PyTorch tensors
A = torch.randn(2048, 2048, device='cuda', dtype=torch.float16)
B = torch.randn(2048, 2048, device='cuda', dtype=torch.float16)

C = ob.matmul_fp16(A.cpu().numpy(), B.cpu().numpy())

🎯 Features

Feature Description
FP16/FP32 GEMM Tensor Core acceleration
Fused Attention Single kernel Q×K×V
100% Independent No cuBLAS dependency
Hand-Tuned PTX Optimized CUDA kernels

📊 Benchmark Results (RTX 4070 Laptop)

All benchmarks validated using CUDA Events.

FP16 GEMM (Tensor Cores)

Matrix Size OktoBLAS PyTorch Ratio
1024×1024 29.1 TF 23.3 TF 125%
2048×2048 35.1 TF 34.6 TF 101%
4096×4096 36.5 TF 38.9 TF 94%

Fused Attention

Config OktoBLAS PyTorch Ratio
B4 S256 D64 0.96 TF 0.28 TF 346%
B4 S512 D64 1.22 TF 0.93 TF 131%

🚀 Roadmap

  • FP16/FP32 GEMM with Tensor Cores
  • Fused Attention kernel
  • PyPI package release
  • ROCm (AMD) support
  • Metal (Apple) support
  • Full PyTorch autograd integration

📚 Part of OktoSeek Ecosystem

OktoBLAS is part of the OktoSeek ecosystem:

Project Description Link
OktoScript AI programming language GitHub
OktoEngine Native ML inference engine Coming soon
OktoStudio AI Development IDE Coming soon
OktoBLAS High-performance BLAS GitHub
OkTensor GPU tensor library Part of OktoEngine

📜 License

Proprietary License - Free for personal and commercial use.

Copyright (c) 2025 OktoSeek AI. All Rights Reserved.

See LICENSE.txt for details.


🙏 Credits

Built with ❤️ by OktoSeek AI.


Star us on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oktoblas-1.0.1-cp310-cp310-win_amd64.whl (198.4 kB view details)

Uploaded CPython 3.10Windows x86-64

File details

Details for the file oktoblas-1.0.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for oktoblas-1.0.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 269799967bf1df3539c06d143d5749250d6a32f80949a63d1c94ca48f5838f0c
MD5 9a5968c8e621af04d6b6567830a5f260
BLAKE2b-256 b5bdb6eaf6bf22f5ba5b6fd4f29efd84fecb47cdac868a93588171b13d1df770

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page