Cross-Platform ML Optimization Framework with ONNX Interpreter

These details have not been verified by PyPI

Project links

Project description

Zenith

Cross-Platform ML Optimization Framework

Zenith is a production-ready, model-agnostic and hardware-agnostic optimization framework for Machine Learning. It provides enterprise-grade performance optimizations with native CUDA kernels and Tensor Core acceleration.

Project History

Zenith was conceived and architecturally designed on December 11, 2024, with the creation of its comprehensive blueprint document (CetakBiru.md) that outlines a 36-month development roadmap across 6 implementation phases. Active development began on January 12, 2025, and after 11 months of internal development, research, and rigorous testing, Zenith was publicly released on GitHub on December 16, 2025.

The project represents nearly a year of dedicated work in building a production-ready ML optimization framework from the ground up, implementing CUDA backends with cuDNN/cuBLAS integration, graph optimization passes, mixed precision support, and comprehensive testing infrastructure.

Performance Highlights

Benchmark	Workload	Result
GPU Memory Pool	MatMul 1024x1024	50x faster than PyTorch
BERT Inference	12-layer encoder	1.09x faster than PyTorch
Training Loop	6-layer Transformer	1.02x faster than PyTorch
Memory Efficiency	Zero-copy allocation	93.5% cache hit rate
INT8 Quantization	Model compression	4x memory reduction

Benchmarked on NVIDIA Tesla T4 (Google Colab). See BENCHMARK_REPORT.md for full results.

Features

Core Capabilities

Unified API for PyTorch, TensorFlow, JAX, and ONNX models
Automatic graph optimizations (operator fusion, constant folding, dead code elimination)
Multi-backend support (CPU with SIMD, CUDA with cuDNN/cuBLAS)
Mixed precision inference (FP16, BF16, INT8)
Zero-copy GPU memory pooling for minimal allocation overhead

Optimization Passes

Conv-BatchNorm-ReLU fusion
Linear-GELU fusion (BERT-optimized)
LayerNorm-Add fusion
Constant folding and dead code elimination
INT8 quantization with calibration

Hardware Support

CPU: AVX2/FMA SIMD optimizations
NVIDIA GPU: CUDA 12.x with cuDNN 8.x and cuBLAS
AMD GPU: ROCm support (experimental)
Intel: OneAPI support (experimental)

Native CUDA Kernels

Zenith includes JIT-compiled native CUDA kernels for maximum performance:

Kernel	Description	Tensor Core
`relu`	ReLU activation	-
`gelu`	GELU activation (BERT)	-
`layernorm`	Layer Normalization	-
`matmul`	Matrix Multiplication (FP32)	-
`wmma_matmul`	Matrix Multiplication (FP16)	WMMA
`flash_attention`	Flash Attention v2	-

# Build native kernels (requires CUDA)
python zenith/build_cuda.py

# Use in code
import zenith_cuda
C = zenith_cuda.wmma_matmul(A.half(), B.half())  # Tensor Core accelerated

Installation

Quick Install

pip install pyzenith

Installation Options

Choose the right installation based on your needs:

Command	Use Case	What's Included
`pip install pyzenith`	Quick start, testing	Core only (numpy)
`pip install pyzenith[pytorch]`	PyTorch users	+ PyTorch 2.0+
`pip install pyzenith[onnx]`	Model deployment, inference	+ ONNX + ONNX Runtime
`pip install pyzenith[tensorflow]`	TensorFlow users	+ TensorFlow + tf2onnx
`pip install pyzenith[jax]`	JAX/Flax users	+ JAX + JAXlib
`pip install pyzenith[all]`	Full functionality	All frameworks
`pip install pyzenith[dev]`	Contributors	+ pytest, black, mypy, ruff

Recommended Installation

# For most ML users (PyTorch + ONNX export)
pip install pyzenith[pytorch,onnx]

# For full framework support
pip install pyzenith[all]

# For development/contribution
pip install pyzenith[dev]

Development Installation

git clone https://github.com/vibeswithkk/ZENITH.git
cd ZENITH
pip install -e ".[dev]"

CUDA Build (for Maximum GPU Performance)

For full CUDA kernel acceleration (50x speedup):

# On Google Colab or Linux with CUDA
git clone https://github.com/vibeswithkk/ZENITH.git
cd ZENITH
bash build_cuda.sh

# Verify installation
python -c "from zenith._zenith_core import backends; print(backends.list_available())"
# Output: ['cpu', 'cuda']

Note: Without CUDA build, Zenith still provides full performance via PyTorch/TensorFlow CUDA backends.

Quick Start

Basic Usage

import zenith
from zenith.core import GraphIR, DataType, Shape, TensorDescriptor

# Create a computation graph
graph = GraphIR(name="my_model")
graph.add_input(TensorDescriptor("x", Shape([1, 3, 224, 224]), DataType.Float32))

# Apply optimizations
from zenith.optimization import PassManager
pm = PassManager()
pm.add("constant_folding")
pm.add("dead_code_elimination")
pm.add("operator_fusion")
optimized = pm.run(graph)

CUDA Operations

import numpy as np
from zenith._zenith_core import cuda

# Check CUDA availability
print(f"CUDA available: {cuda.is_available()}")

# Matrix multiplication (50x faster than PyTorch)
A = np.random.randn(1024, 1024).astype(np.float32)
B = np.random.randn(1024, 1024).astype(np.float32)
C = cuda.matmul(A, B)

# GPU operations
cuda.gelu(input_tensor)
cuda.layernorm(input_tensor, gamma, beta, eps=1e-5)
cuda.softmax(input_tensor)

Architecture

+-------------------------------------------------------------+
|                    Python User Interface                    |
|                  (zenith.api, zenith.core)                  |
+-------------------------------------------------------------+
|              Framework-Specific Adapters Layer              |
|          (PyTorch, TensorFlow, JAX -> ONNX -> IR)           |
+-------------------------------------------------------------+
|       Core Optimization & Compilation Engine (C++)          |
|  - Graph IR with type-safe operations                       |
|  - PassManager with optimization passes                     |
|  - Kernel Registry and Dispatcher                           |
+-------------------------------------------------------------+
|           Hardware Abstraction Layer (HAL)                  |
|     CPU (AVX2/FMA) | CUDA (cuDNN/cuBLAS) | ROCm | OneAPI    |
+-------------------------------------------------------------+

Benchmarks

BERT-Base Inference (12 layers, batch=1, seq=128)

Mode	Latency	vs PyTorch
Pure PyTorch	10.60 ms	baseline
Zenith + PyTorch	9.74 ms	1.09x faster

ResNet-50 Throughput

Batch Size	Throughput
1	150 img/sec
64	377 img/sec
512	359 img/sec

GPU Memory Pool

Metric	Value
Cache Hit Rate	93.5%
Speedup vs naive	330x

Testing

# Run all Python tests
pytest tests/python/ -v

# Run with coverage
pytest tests/python/ --cov=zenith --cov-report=term-missing

# Run C++ unit tests (after CUDA build)
./build/tests/test_core

# Security scan
bandit -r zenith/ -ll

Test Status

Python Tests: 198+ passed
C++ Tests: 34/34 passed
Code Coverage: 66%+
Security Issues: 0 HIGH severity

Documentation

Benchmark Report - Comprehensive performance benchmarks
API Reference - Python API documentation
Architecture - System design documentation

Project Status

Zenith is currently in active development with the following milestones completed:

Phase 1: Core Graph IR and C++ foundation
Phase 2: CUDA backend with cuDNN/cuBLAS integration
Phase 3: Optimization passes and quantization
Phase 4: Quality assurance and documentation

Contributing

Contributions are welcome. Please ensure all tests pass before submitting pull requests.

# Setup development environment
pip install -e ".[dev]"

# Run tests before committing
pytest tests/python/ -v

Author

Wahyu Ardiansyah - Lead Architect and Developer

License

Apache License 2.0 - See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.5

Jan 1, 2026

0.3.4

Jan 1, 2026

0.3.3

Jan 1, 2026

0.3.2

Jan 1, 2026

0.3.1

Jan 1, 2026

0.3.0

Dec 29, 2025

0.2.9

Dec 27, 2025

This version

0.2.8

Dec 25, 2025

0.2.7

Dec 24, 2025

0.2.6

Dec 24, 2025

0.2.5

Dec 24, 2025

0.2.4

Dec 22, 2025

0.2.3

Dec 22, 2025

0.2.2

Dec 22, 2025

0.2.1

Dec 22, 2025

0.2.0

Dec 22, 2025

0.1.4

Dec 19, 2025

0.1.0

Dec 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyzenith-0.2.8.tar.gz (416.5 kB view details)

Uploaded Dec 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyzenith-0.2.8-py3-none-any.whl (400.1 kB view details)

Uploaded Dec 25, 2025 Python 3

File details

Details for the file pyzenith-0.2.8.tar.gz.

File metadata

Download URL: pyzenith-0.2.8.tar.gz
Upload date: Dec 25, 2025
Size: 416.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pyzenith-0.2.8.tar.gz
Algorithm	Hash digest
SHA256	`6201cca599e8235712c60b5905071544d610e5c5936d72401ab609881ea8818e`
MD5	`e086bfceaf2295f7d95b2c84b74a6590`
BLAKE2b-256	`754020c9682b781144c354cbd418543b6d8818ff3c5f0d7d163fafda703c2114`

See more details on using hashes here.

File details

Details for the file pyzenith-0.2.8-py3-none-any.whl.

File metadata

Download URL: pyzenith-0.2.8-py3-none-any.whl
Upload date: Dec 25, 2025
Size: 400.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pyzenith-0.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`60cbb36b5c27277b8b36f1f86787df82ec804f0dbc1b4f890e9cf1779d79a320`
MD5	`d44452ce28e651e89158b97c54c2951b`
BLAKE2b-256	`f6363de05483f59add7f49ff2b7b03e4549bfaf9db639ded5aa3b82077f2d275`

See more details on using hashes here.

pyzenith 0.2.8

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Zenith

Project History

Performance Highlights

Features

Core Capabilities

Optimization Passes

Hardware Support

Native CUDA Kernels

Installation

Quick Install

Installation Options

Recommended Installation

Development Installation

CUDA Build (for Maximum GPU Performance)

Quick Start

Basic Usage

CUDA Operations

Architecture

Benchmarks

BERT-Base Inference (12 layers, batch=1, seq=128)

ResNet-50 Throughput

GPU Memory Pool

Testing

Test Status

Documentation

Project Status

Contributing

Author

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes