Cross-Platform ML Optimization Framework with ONNX Interpreter
Project description
Zenith
Cross-Platform ML Optimization Framework
Zenith is a model-agnostic and hardware-agnostic unification and optimization framework for Machine Learning. It provides enterprise-grade performance optimizations that consistently outperform PyTorch in both inference and training workloads.
Project History
Zenith was conceived and architecturally designed on December 11, 2024, with the creation of its comprehensive blueprint document (CetakBiru.md) that outlines a 36-month development roadmap across 6 implementation phases. Active development began on January 12, 2025, and after 11 months of internal development, research, and rigorous testing, Zenith was publicly released on GitHub on December 16, 2025.
The project represents nearly a year of dedicated work in building a production-ready ML optimization framework from the ground up, implementing CUDA backends with cuDNN/cuBLAS integration, graph optimization passes, mixed precision support, and comprehensive testing infrastructure.
Performance Highlights
| Benchmark | Workload | Result |
|---|---|---|
| GPU Memory Pool | MatMul 1024x1024 | 50x faster than PyTorch |
| BERT Inference | 12-layer encoder | 1.09x faster than PyTorch |
| Training Loop | 6-layer Transformer | 1.02x faster than PyTorch |
| Memory Efficiency | Zero-copy allocation | 93.5% cache hit rate |
| INT8 Quantization | Model compression | 4x memory reduction |
Benchmarked on NVIDIA Tesla T4 (Google Colab). See BENCHMARK_REPORT.md for full results.
Features
Core Capabilities
- Unified API for PyTorch, TensorFlow, JAX, and ONNX models
- Automatic graph optimizations (operator fusion, constant folding, dead code elimination)
- Multi-backend support (CPU with SIMD, CUDA with cuDNN/cuBLAS)
- Mixed precision inference (FP16, BF16, INT8)
- Zero-copy GPU memory pooling for minimal allocation overhead
Optimization Passes
- Conv-BatchNorm-ReLU fusion
- Linear-GELU fusion (BERT-optimized)
- LayerNorm-Add fusion
- Constant folding and dead code elimination
- INT8 quantization with calibration
Hardware Support
- CPU: AVX2/FMA SIMD optimizations
- NVIDIA GPU: CUDA 12.x with cuDNN 8.x and cuBLAS
- AMD GPU: ROCm support (planned)
- Intel: OneAPI support (planned)
Installation
Quick Install
pip install pyzenith
Installation Options
Choose the right installation based on your needs:
| Command | Use Case | What's Included |
|---|---|---|
pip install pyzenith |
Quick start, testing | Core only (numpy) |
pip install pyzenith[pytorch] |
PyTorch users | + PyTorch 2.0+ |
pip install pyzenith[onnx] |
Model deployment, inference | + ONNX + ONNX Runtime |
pip install pyzenith[tensorflow] |
TensorFlow users | + TensorFlow + tf2onnx |
pip install pyzenith[jax] |
JAX/Flax users | + JAX + JAXlib |
pip install pyzenith[all] |
Full functionality | All frameworks |
pip install pyzenith[dev] |
Contributors | + pytest, black, mypy, ruff |
Recommended Installation
# For most ML users (PyTorch + ONNX export)
pip install pyzenith[pytorch,onnx]
# For full framework support
pip install pyzenith[all]
# For development/contribution
pip install pyzenith[dev]
Development Installation
git clone https://github.com/vibeswithkk/ZENITH.git
cd ZENITH
pip install -e ".[dev]"
CUDA Build (for Maximum GPU Performance)
For full CUDA kernel acceleration (50x speedup):
# On Google Colab or Linux with CUDA
git clone https://github.com/vibeswithkk/ZENITH.git
cd ZENITH
bash build_cuda.sh
# Verify installation
python -c "from zenith._zenith_core import backends; print(backends.list_available())"
# Output: ['cpu', 'cuda']
Note: Without CUDA build, Zenith still provides full performance via PyTorch/TensorFlow CUDA backends.
Quick Start
Basic Usage
import zenith
from zenith.core import GraphIR, DataType, Shape, TensorDescriptor
# Create a computation graph
graph = GraphIR(name="my_model")
graph.add_input(TensorDescriptor("x", Shape([1, 3, 224, 224]), DataType.Float32))
# Apply optimizations
from zenith.optimization import PassManager
pm = PassManager()
pm.add("constant_folding")
pm.add("dead_code_elimination")
pm.add("operator_fusion")
optimized = pm.run(graph)
CUDA Operations
import numpy as np
from zenith._zenith_core import cuda
# Check CUDA availability
print(f"CUDA available: {cuda.is_available()}")
# Matrix multiplication (50x faster than PyTorch)
A = np.random.randn(1024, 1024).astype(np.float32)
B = np.random.randn(1024, 1024).astype(np.float32)
C = cuda.matmul(A, B)
# GPU operations
cuda.gelu(input_tensor)
cuda.layernorm(input_tensor, gamma, beta, eps=1e-5)
cuda.softmax(input_tensor)
Architecture
+-------------------------------------------------------------+
| Python User Interface |
| (zenith.api, zenith.core) |
+-------------------------------------------------------------+
| Framework-Specific Adapters Layer |
| (PyTorch, TensorFlow, JAX -> ONNX -> IR) |
+-------------------------------------------------------------+
| Core Optimization & Compilation Engine (C++) |
| - Graph IR with type-safe operations |
| - PassManager with optimization passes |
| - Kernel Registry and Dispatcher |
+-------------------------------------------------------------+
| Hardware Abstraction Layer (HAL) |
| CPU (AVX2/FMA) | CUDA (cuDNN/cuBLAS) | ROCm | OneAPI |
+-------------------------------------------------------------+
Benchmarks
BERT-Base Inference (12 layers, batch=1, seq=128)
| Mode | Latency | vs PyTorch |
|---|---|---|
| Pure PyTorch | 10.60 ms | baseline |
| Zenith + PyTorch | 9.74 ms | 1.09x faster |
ResNet-50 Throughput
| Batch Size | Throughput |
|---|---|
| 1 | 150 img/sec |
| 64 | 377 img/sec |
| 512 | 359 img/sec |
GPU Memory Pool
| Metric | Value |
|---|---|
| Cache Hit Rate | 93.5% |
| Speedup vs naive | 330x |
Testing
# Run all Python tests
pytest tests/python/ -v
# Run with coverage
pytest tests/python/ --cov=zenith --cov-report=term-missing
# Run C++ unit tests (after CUDA build)
./build/tests/test_core
# Security scan
bandit -r zenith/ -ll
Test Status
- Python Tests: 198+ passed
- C++ Tests: 34/34 passed
- Code Coverage: 66%+
- Security Issues: 0 HIGH severity
Documentation
- Benchmark Report - Comprehensive performance benchmarks
- API Reference - Python API documentation
- Architecture - System design documentation
Project Status
Zenith is currently in active development with the following milestones completed:
- Phase 1: Core Graph IR and C++ foundation
- Phase 2: CUDA backend with cuDNN/cuBLAS integration
- Phase 3: Optimization passes and quantization
- Phase 4: Quality assurance and documentation
Contributing
Contributions are welcome. Please ensure all tests pass before submitting pull requests.
# Setup development environment
pip install -e ".[dev]"
# Run tests before committing
pytest tests/python/ -v
Author
Wahyu Ardiansyah - Lead Architect and Developer
License
Apache License 2.0 - See LICENSE for details.
Copyright 2025 Wahyu Ardiansyah. All rights reserved.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyzenith-0.2.2.tar.gz.
File metadata
- Download URL: pyzenith-0.2.2.tar.gz
- Upload date:
- Size: 406.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f6a5a7126421d1defde6417891b79db82cfc343ec182670275f4190e3932d92
|
|
| MD5 |
c00dc8faacd2a28d2baf840ce882fcaf
|
|
| BLAKE2b-256 |
0bf22762674e7e729828b5fc6a11b83ff00203b64519f643cffab37ce18706a4
|
File details
Details for the file pyzenith-0.2.2-py3-none-any.whl.
File metadata
- Download URL: pyzenith-0.2.2-py3-none-any.whl
- Upload date:
- Size: 390.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5ccc432692f9e84ba51bf49c6192a6b55f2b2392bca9827f9e532e1853378d8
|
|
| MD5 |
29f4c148f36d3ee093717eda7bc87f2f
|
|
| BLAKE2b-256 |
c8ff56a695e7e5d11f74916f5eeb7f14c678e652ec4b4d6c4e19c2d2b403850b
|