SympleX – Polyhedral Tensor Superoptimizer with JAX-style purity enforcement and x86-64 JIT compilation
Project description
SympleX — Polyhedral Tensor Superoptimizer
SympleX is a C++20 compiler engine that combines equality saturation (e-graph superoptimization) with the polyhedral model to automatically discover mathematically equivalent but faster programs for AI training. It explores the space of valid program transformations — not just tile sizes — and maps the best discovered program directly onto GPU Tensor Cores, SRAM hierarchies, and distributed cluster topologies.
Architecture
[AI Model Graph / Iteration Space]
│
▼
┌──────────────────────────────────────────────────────┐
│ Level 1: Mathematical Superoptimizer (E-Graph) │
│ ┌────────────────────────────────────────────────┐ │
│ │ E-Graph: compactly represents exponentially │ │
│ │ many equivalent programs as equivalence classes │ │
│ └────────────────────────────────────────────────┘ │
│ Rewrite Rules: │
│ - A*B + A*C → A*(B+C) (factor / distribute) │
│ - ReLU(MatMul(A,B)) → FusedMatMulReLU(A,B) │
│ - MatMul(A,B)+bias → FusedMatMulAdd(A,B,bias) │
│ - A + 0 == A, A * 1 == A, A * 0 == 0 │
│ - (A@B)^T == B^T@A^T, Transpose(Transpose(A))=A │
│ - Softmax decomposition, LayerNorm decomposition │
│ - Tiling decomposition, tile-pointwise distribution│
│ Polyhedral Guardrails: │
│ - Rejects rewrites that violate dependencies │
│ Cost-Guided Extraction: │
│ - Picks cheapest program from saturated e-graph │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Level 2: Hardware-Mapping Search │
│ Phase 1: Roofline pruning (memory-bound filtering) │
│ Phase 2: Compute-symmetry alignment (TC multiples) │
│ Phase 3: Hardware occupancy sieve (analytical/emp.) │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Code Generator (PTX Emitter) │
│ - WMMA/MMA Tensor Core instructions │
│ - Shared memory swizzling (XOR bank-conflict avoid) │
│ - Async TMA / cp.async pipelines │
│ - Double-buffering & software pipelining │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Distributed Sharding & Fault Tolerance │
│ - 2D mesh (TP × PP × DP) │
│ - NCCL collective scheduling │
│ - 1F1B pipeline overlap │
│ - Resilient forward recovery (no rollback) │
│ - Dynamic micro-batching │
│ - Activation checkpointing (save/recompute/offload) │
└──────────────────────────────────────────────────────┘
│
▼
[Optimized GPU Binary / PTX]
Modules
| Module | Path | Description |
|---|---|---|
| Polyhedral | include/symplex/polyhedral/ |
Integer polytopes, affine maps, dependency analysis, iteration spaces |
| Schedule | include/symplex/schedule/ |
Schedule trees, tiling, operator fusion, GPU parallelization mapping |
| Hardware | include/symplex/hardware/ |
GPU topology, Tensor Core specs, memory hierarchy, roofline model |
| Optimizer | include/symplex/optimizer/ |
E-graph superoptimizer + 3-phase hardware search (roofline → symmetry → occupancy) |
| Cost Model | include/symplex/costmodel/ |
Roofline, analytical, empirical, and hybrid cost models |
| Codegen | include/symplex/codegen/ |
PTX emitter, WMMA/MMA instruction generation, swizzling, register allocation |
| Distributed | include/symplex/distributed/ |
Cluster mesh, SPMD sharding, NCCL bridge, pipeline overlap |
| Fault Tolerance | include/symplex/fault_tolerance/ |
Health monitoring, forward recovery, communicator repair, activation checkpointing |
| Training | include/symplex/training/ |
Training loop orchestrator, dynamic batch sizing, memory watchdog, JIT compiler pipeline |
Quick Start
Prerequisites
- C++20 compiler (GCC 12+, Clang 15+)
- CMake 3.20+
- Optional: CUDA Toolkit 12+ (for empirical profiling and GPU execution)
- Optional: ISL (Integer Set Library) for enhanced polyhedral analysis
- Optional: NCCL (for distributed training)
Build
git clone https://github.com/hollowguy898-cloud/SympleX.git
cd SympleX
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
Run Tests
./symplex_tests
Run the Matmul Optimization Example
./example_matmul
This optimizes a 4096×4096×2048 matrix multiplication for the H100 GPU target, running the full 3-phase superoptimizer search and generating PTX code.
Usage
Optimize a Matrix Multiplication
#include "symplex/training/compiler_pipeline.h"
#include "symplex/hardware/hardware_target.h"
using namespace symplex::training;
using namespace symplex::hardware;
// Target: NVIDIA H100
HardwareTarget target = HardwareTarget::H100();
// Create compiler pipeline
CompilerPipeline pipeline(target);
// Optimize matmul C[M,N] += A[M,K] * B[K,N]
auto result = pipeline.compile_matmul(4096, 4096, 2048);
// result.ptx_source — generated PTX kernel
// result.estimated_latency_ns — predicted latency
// result.speedup_vs_naive — speedup over naive tiling
// result.grid_dims / block_dims — GPU launch parameters
Use the Superoptimizer Directly
#include "symplex/optimizer/superoptimizer.h"
#include "symplex/polyhedral/iteration_space.h"
#include "symplex/hardware/hardware_target.h"
using namespace symplex::optimizer;
using namespace symplex::polyhedral;
using namespace symplex::hardware;
HardwareTarget target = HardwareTarget::H100();
Superoptimizer opt(target);
auto ispace = make_matmul_iteration_space(1024, 1024, 512);
auto result = opt.optimize(ispace);
// result.best_tile — optimal TileConfig
// result.estimated_latency_ns — predicted latency
// result.speedup_vs_naive — speedup vs smallest Tensor Core tile
Distributed Training with Fault Tolerance
#include "symplex/training/training_loop.h"
#include "symplex/hardware/hardware_target.h"
using namespace symplex::training;
using namespace symplex::hardware;
TrainingConfig config;
config.global_batch_size = 2048;
config.enable_fault_tolerance = true;
config.enable_dynamic_batching = true;
HardwareTarget target = HardwareTarget::H100();
TrainingLoop loop(config, target);
auto ispace = make_matmul_iteration_space(4096, 4096, 2048);
loop.initialize(ispace);
auto results = loop.execute_epoch();
Key Mathematical Concepts
Iteration Space (I)
Every AI loop nest is modeled as an integer polytope:
I = { i ∈ Z^n | A·i + b ≥ 0 }
Data Dependency Polyhedron (D)
Dependencies are vectors in the polyhedral space that must remain lexicographically positive:
d = i_sink - i_source, d ≥ 0
Schedule Map (Φ)
The central optimization maps iteration points to hardware coordinates and time:
Φ(i) → (DeviceID, SM_ID, Warp_ID, Thread_ID, TimeStep)
3-Phase Superoptimizer Search (Level 2)
- Roofline Pruning: Drop 90% of tile configurations using analytical operational intensity bounds
- Compute-Symmetry Alignment: Only evaluate tile sizes that are exact multiples of Tensor Core fragment dimensions (16×8×16 for H100)
- Hardware Occupancy Sieve: Micro-benchmark the top candidates, selecting the configuration that maximizes SM occupancy
Equality Saturation (Level 1 — the real superoptimizer)
Unlike traditional autotuners that only sweep hardware parameters, SympleX's superoptimizer explores the space of equivalent programs using equality saturation:
- E-Graph Construction: The input program is represented as an e-graph — a data structure that compactly represents exponentially many equivalent expressions as equivalence classes
- Rewrite Rule Application: Algebraic identities, fusion patterns, and tiling decompositions are applied iteratively, growing the e-graph to represent all discovered equivalent programs
- Polyhedral Guardrails: Before any extracted program is accepted, it is validated against the original computation's data dependencies — rewrites that would violate semantics are rejected
- Cost-Guided Extraction: The cheapest program is extracted from the saturated e-graph using a bottom-up dynamic programming approach, where fused operations (e.g.,
FusedMatMulReLU) have lower cost than their unfused equivalents (ReLU(MatMul(A,B)))
Hardware Targets
Built-in profiles for:
| GPU | SMs | Tensor Core | HBM BW | SRAM/SM |
|---|---|---|---|---|
| H100 (Hopper) | 132 | 16×8×16 FP16 | 3.35 TB/s | 228 KB |
| B200 (Blackwell) | 160 | 16×8×32 FP16 | 8.0 TB/s | 304 KB |
| Generic | 84 | 16×8×16 FP16 | 2.0 TB/s | 164 KB |
Custom targets can be constructed via HardwareTarget fields.
Project Structure
SympleX/
├── CMakeLists.txt
├── LICENSE # GNU AGPL v3
├── README.md
├── include/symplex/
│ ├── polyhedral/ # Core polyhedral types
│ │ ├── integer_polytope.h
│ │ ├── affine_map.h
│ │ ├── dependency.h
│ │ ├── iteration_space.h
│ │ └── union_map.h
│ ├── schedule/ # Schedule tree & transformations
│ │ ├── schedule_tree.h
│ │ ├── tiling.h
│ │ ├── fusion.h
│ │ ├── parallelization.h
│ │ └── schedule_map.h
│ ├── hardware/ # GPU hardware models
│ │ └── hardware_target.h
│ ├── optimizer/ # Superoptimizer search
│ │ ├── tile_config.h
│ │ ├── search_phase1.h
│ │ ├── search_phase2.h
│ │ ├── search_phase3.h
│ │ └── superoptimizer.h
│ ├── costmodel/ # Performance cost models
│ │ ├── roofline.h
│ │ ├── analytical.h
│ │ ├── empirical.h
│ │ └── cost_model.h
│ ├── codegen/ # PTX code generation
│ │ ├── wmma.h
│ │ ├── swizzle.h
│ │ ├── register_allocator.h
│ │ ├── ptx_emitter.h
│ │ └── code_generator.h
│ ├── distributed/ # Distributed training
│ │ ├── mesh.h
│ │ ├── sharding.h
│ │ ├── nccl_bridge.h
│ │ └── pipeline_overlap.h
│ ├── fault_tolerance/ # Fault tolerance
│ │ ├── health_monitor.h
│ │ ├── forward_recovery.h
│ │ ├── communicator_repair.h
│ │ └── checkpoint.h
│ └── training/ # Training orchestrator
│ ├── dynamic_batch.h
│ ├── memory_watchdog.h
│ ├── training_loop.h
│ └── compiler_pipeline.h
├── src/ # Implementation files (mirrors include/)
├── tests/ # Unit tests
├── benchmarks/ # Performance benchmarks
└── examples/ # Usage examples
License
GNU Affero General Public License v3 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simplex_tensor-1.1.0.tar.gz.
File metadata
- Download URL: simplex_tensor-1.1.0.tar.gz
- Upload date:
- Size: 479.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f937a4c6fb9ad5a783c23cfafa02c53f05ff942de752b67064cfac24e9b21a9e
|
|
| MD5 |
12d1042965d206cf8e15e393e3de083d
|
|
| BLAKE2b-256 |
e0cfb0ceec6cefb0c3c3fe0b3a1e8e8c1b792c3dd188ca1188040e7d72dbdc5b
|
File details
Details for the file simplex_tensor-1.1.0-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: simplex_tensor-1.1.0-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5fa4eacaa102a746fca9969e7627b357e118b2f95eb9a6c281ddc81b0f11baeb
|
|
| MD5 |
2ff9dd2d76b04f1d716630d8a0f5cdd6
|
|
| BLAKE2b-256 |
16b71e6d854f110e806036c9e65e5215dc79ec17dce01d15853b91b8175a27eb
|