Skip to main content

SympleX – Polyhedral Tensor Superoptimizer with JAX-style purity enforcement and x86-64 JIT compilation

Project description

SympleX — Polyhedral Tensor Superoptimizer

SympleX is a C++20 compiler engine that combines equality saturation (e-graph superoptimization) with the polyhedral model to automatically discover mathematically equivalent but faster programs for AI training. It explores the space of valid program transformations — not just tile sizes — and maps the best discovered program directly onto GPU Tensor Cores, SRAM hierarchies, and distributed cluster topologies.

Architecture

[AI Model Graph / Iteration Space]
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Level 1: Mathematical Superoptimizer (E-Graph)      │
│  ┌────────────────────────────────────────────────┐  │
│  │  E-Graph: compactly represents exponentially    │  │
│  │  many equivalent programs as equivalence classes │  │
│  └────────────────────────────────────────────────┘  │
│  Rewrite Rules:                                      │
│   - A*B + A*C → A*(B+C)   (factor / distribute)    │
│   - ReLU(MatMul(A,B)) → FusedMatMulReLU(A,B)       │
│   - MatMul(A,B)+bias → FusedMatMulAdd(A,B,bias)    │
│   - A + 0 == A,  A * 1 == A,  A * 0 == 0           │
│   - (A@B)^T == B^T@A^T,  Transpose(Transpose(A))=A │
│   - Softmax decomposition, LayerNorm decomposition   │
│   - Tiling decomposition, tile-pointwise distribution│
│  Polyhedral Guardrails:                              │
│   - Rejects rewrites that violate dependencies       │
│  Cost-Guided Extraction:                             │
│   - Picks cheapest program from saturated e-graph    │
└──────────────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Level 2: Hardware-Mapping Search                    │
│  Phase 1: Roofline pruning (memory-bound filtering)  │
│  Phase 2: Compute-symmetry alignment (TC multiples)  │
│  Phase 3: Hardware occupancy sieve (analytical/emp.) │
└──────────────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Code Generator (PTX Emitter)                        │
│  - WMMA/MMA Tensor Core instructions                 │
│  - Shared memory swizzling (XOR bank-conflict avoid) │
│  - Async TMA / cp.async pipelines                    │
│  - Double-buffering & software pipelining             │
└──────────────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Distributed Sharding & Fault Tolerance              │
│  - 2D mesh (TP × PP × DP)                           │
│  - NCCL collective scheduling                         │
│  - 1F1B pipeline overlap                             │
│  - Resilient forward recovery (no rollback)           │
│  - Dynamic micro-batching                            │
│  - Activation checkpointing (save/recompute/offload)  │
└──────────────────────────────────────────────────────┘
              │
              ▼
     [Optimized GPU Binary / PTX]

Modules

Module Path Description
Polyhedral include/symplex/polyhedral/ Integer polytopes, affine maps, dependency analysis, iteration spaces
Schedule include/symplex/schedule/ Schedule trees, tiling, operator fusion, GPU parallelization mapping
Hardware include/symplex/hardware/ GPU topology, Tensor Core specs, memory hierarchy, roofline model
Optimizer include/symplex/optimizer/ E-graph superoptimizer + 3-phase hardware search (roofline → symmetry → occupancy)
Cost Model include/symplex/costmodel/ Roofline, analytical, empirical, and hybrid cost models
Codegen include/symplex/codegen/ PTX emitter, WMMA/MMA instruction generation, swizzling, register allocation
Distributed include/symplex/distributed/ Cluster mesh, SPMD sharding, NCCL bridge, pipeline overlap
Fault Tolerance include/symplex/fault_tolerance/ Health monitoring, forward recovery, communicator repair, activation checkpointing
Training include/symplex/training/ Training loop orchestrator, dynamic batch sizing, memory watchdog, JIT compiler pipeline

Quick Start

Prerequisites

  • C++20 compiler (GCC 12+, Clang 15+)
  • CMake 3.20+
  • Optional: CUDA Toolkit 12+ (for empirical profiling and GPU execution)
  • Optional: ISL (Integer Set Library) for enhanced polyhedral analysis
  • Optional: NCCL (for distributed training)

Build

git clone https://github.com/hollowguy898-cloud/SympleX.git
cd SympleX
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Run Tests

./symplex_tests

Run the Matmul Optimization Example

./example_matmul

This optimizes a 4096×4096×2048 matrix multiplication for the H100 GPU target, running the full 3-phase superoptimizer search and generating PTX code.

Usage

Optimize a Matrix Multiplication

#include "symplex/training/compiler_pipeline.h"
#include "symplex/hardware/hardware_target.h"

using namespace symplex::training;
using namespace symplex::hardware;

// Target: NVIDIA H100
HardwareTarget target = HardwareTarget::H100();

// Create compiler pipeline
CompilerPipeline pipeline(target);

// Optimize matmul C[M,N] += A[M,K] * B[K,N]
auto result = pipeline.compile_matmul(4096, 4096, 2048);

// result.ptx_source     — generated PTX kernel
// result.estimated_latency_ns  — predicted latency
// result.speedup_vs_naive     — speedup over naive tiling
// result.grid_dims / block_dims — GPU launch parameters

Use the Superoptimizer Directly

#include "symplex/optimizer/superoptimizer.h"
#include "symplex/polyhedral/iteration_space.h"
#include "symplex/hardware/hardware_target.h"

using namespace symplex::optimizer;
using namespace symplex::polyhedral;
using namespace symplex::hardware;

HardwareTarget target = HardwareTarget::H100();
Superoptimizer opt(target);

auto ispace = make_matmul_iteration_space(1024, 1024, 512);
auto result = opt.optimize(ispace);

// result.best_tile         — optimal TileConfig
// result.estimated_latency_ns — predicted latency
// result.speedup_vs_naive  — speedup vs smallest Tensor Core tile

Distributed Training with Fault Tolerance

#include "symplex/training/training_loop.h"
#include "symplex/hardware/hardware_target.h"

using namespace symplex::training;
using namespace symplex::hardware;

TrainingConfig config;
config.global_batch_size = 2048;
config.enable_fault_tolerance = true;
config.enable_dynamic_batching = true;

HardwareTarget target = HardwareTarget::H100();
TrainingLoop loop(config, target);

auto ispace = make_matmul_iteration_space(4096, 4096, 2048);
loop.initialize(ispace);

auto results = loop.execute_epoch();

Key Mathematical Concepts

Iteration Space (I)

Every AI loop nest is modeled as an integer polytope:

I = { i ∈ Z^n | A·i + b ≥ 0 }

Data Dependency Polyhedron (D)

Dependencies are vectors in the polyhedral space that must remain lexicographically positive:

d = i_sink - i_source,  d ≥ 0

Schedule Map (Φ)

The central optimization maps iteration points to hardware coordinates and time:

Φ(i) → (DeviceID, SM_ID, Warp_ID, Thread_ID, TimeStep)

3-Phase Superoptimizer Search (Level 2)

  1. Roofline Pruning: Drop 90% of tile configurations using analytical operational intensity bounds
  2. Compute-Symmetry Alignment: Only evaluate tile sizes that are exact multiples of Tensor Core fragment dimensions (16×8×16 for H100)
  3. Hardware Occupancy Sieve: Micro-benchmark the top candidates, selecting the configuration that maximizes SM occupancy

Equality Saturation (Level 1 — the real superoptimizer)

Unlike traditional autotuners that only sweep hardware parameters, SympleX's superoptimizer explores the space of equivalent programs using equality saturation:

  1. E-Graph Construction: The input program is represented as an e-graph — a data structure that compactly represents exponentially many equivalent expressions as equivalence classes
  2. Rewrite Rule Application: Algebraic identities, fusion patterns, and tiling decompositions are applied iteratively, growing the e-graph to represent all discovered equivalent programs
  3. Polyhedral Guardrails: Before any extracted program is accepted, it is validated against the original computation's data dependencies — rewrites that would violate semantics are rejected
  4. Cost-Guided Extraction: The cheapest program is extracted from the saturated e-graph using a bottom-up dynamic programming approach, where fused operations (e.g., FusedMatMulReLU) have lower cost than their unfused equivalents (ReLU(MatMul(A,B)))

Hardware Targets

Built-in profiles for:

GPU SMs Tensor Core HBM BW SRAM/SM
H100 (Hopper) 132 16×8×16 FP16 3.35 TB/s 228 KB
B200 (Blackwell) 160 16×8×32 FP16 8.0 TB/s 304 KB
Generic 84 16×8×16 FP16 2.0 TB/s 164 KB

Custom targets can be constructed via HardwareTarget fields.

Project Structure

SympleX/
├── CMakeLists.txt
├── LICENSE                          # GNU AGPL v3
├── README.md
├── include/symplex/
│   ├── polyhedral/                  # Core polyhedral types
│   │   ├── integer_polytope.h
│   │   ├── affine_map.h
│   │   ├── dependency.h
│   │   ├── iteration_space.h
│   │   └── union_map.h
│   ├── schedule/                    # Schedule tree & transformations
│   │   ├── schedule_tree.h
│   │   ├── tiling.h
│   │   ├── fusion.h
│   │   ├── parallelization.h
│   │   └── schedule_map.h
│   ├── hardware/                    # GPU hardware models
│   │   └── hardware_target.h
│   ├── optimizer/                   # Superoptimizer search
│   │   ├── tile_config.h
│   │   ├── search_phase1.h
│   │   ├── search_phase2.h
│   │   ├── search_phase3.h
│   │   └── superoptimizer.h
│   ├── costmodel/                   # Performance cost models
│   │   ├── roofline.h
│   │   ├── analytical.h
│   │   ├── empirical.h
│   │   └── cost_model.h
│   ├── codegen/                     # PTX code generation
│   │   ├── wmma.h
│   │   ├── swizzle.h
│   │   ├── register_allocator.h
│   │   ├── ptx_emitter.h
│   │   └── code_generator.h
│   ├── distributed/                 # Distributed training
│   │   ├── mesh.h
│   │   ├── sharding.h
│   │   ├── nccl_bridge.h
│   │   └── pipeline_overlap.h
│   ├── fault_tolerance/             # Fault tolerance
│   │   ├── health_monitor.h
│   │   ├── forward_recovery.h
│   │   ├── communicator_repair.h
│   │   └── checkpoint.h
│   └── training/                    # Training orchestrator
│       ├── dynamic_batch.h
│       ├── memory_watchdog.h
│       ├── training_loop.h
│       └── compiler_pipeline.h
├── src/                             # Implementation files (mirrors include/)
├── tests/                           # Unit tests
├── benchmarks/                      # Performance benchmarks
└── examples/                        # Usage examples

License

GNU Affero General Public License v3 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simplex_tensor-1.5.2-cp312-cp312-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file simplex_tensor-1.5.2-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for simplex_tensor-1.5.2-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 566df180b3a61b77b26457581e228b7bcf7f77cb84da002929fcd0288e20b4d9
MD5 c34e241ca684af6f8e76c587e737c2ae
BLAKE2b-256 7e5ba23ca9235ac7b1bfcf8c9c6cfbdc35475f8278415f89b970fed51553102f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page