SympleX – Polyhedral Tensor Superoptimizer with JAX-style purity enforcement and x86-64 JIT compilation

These details have not been verified by PyPI

Project links

Project description

SympleX — Polyhedral Tensor Superoptimizer

SympleX is a C++20 compiler engine that combines equality saturation (e-graph superoptimization) with the polyhedral model to automatically discover mathematically equivalent but faster programs for AI training. It explores the space of valid program transformations — not just tile sizes — and maps the best discovered program directly onto GPU Tensor Cores, SRAM hierarchies, and distributed cluster topologies.

Architecture

[AI Model Graph / Iteration Space]
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Level 1: Mathematical Superoptimizer (E-Graph)      │
│  ┌────────────────────────────────────────────────┐  │
│  │  E-Graph: compactly represents exponentially    │  │
│  │  many equivalent programs as equivalence classes │  │
│  └────────────────────────────────────────────────┘  │
│  Rewrite Rules:                                      │
│   - A*B + A*C → A*(B+C)   (factor / distribute)    │
│   - ReLU(MatMul(A,B)) → FusedMatMulReLU(A,B)       │
│   - MatMul(A,B)+bias → FusedMatMulAdd(A,B,bias)    │
│   - A + 0 == A,  A * 1 == A,  A * 0 == 0           │
│   - (A@B)^T == B^T@A^T,  Transpose(Transpose(A))=A │
│   - Softmax decomposition, LayerNorm decomposition   │
│   - Tiling decomposition, tile-pointwise distribution│
│  Polyhedral Guardrails:                              │
│   - Rejects rewrites that violate dependencies       │
│  Cost-Guided Extraction:                             │
│   - Picks cheapest program from saturated e-graph    │
└──────────────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Level 2: Hardware-Mapping Search                    │
│  Phase 1: Roofline pruning (memory-bound filtering)  │
│  Phase 2: Compute-symmetry alignment (TC multiples)  │
│  Phase 3: Hardware occupancy sieve (analytical/emp.) │
└──────────────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Code Generator (PTX Emitter)                        │
│  - WMMA/MMA Tensor Core instructions                 │
│  - Shared memory swizzling (XOR bank-conflict avoid) │
│  - Async TMA / cp.async pipelines                    │
│  - Double-buffering & software pipelining             │
└──────────────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Distributed Sharding & Fault Tolerance              │
│  - 2D mesh (TP × PP × DP)                           │
│  - NCCL collective scheduling                         │
│  - 1F1B pipeline overlap                             │
│  - Resilient forward recovery (no rollback)           │
│  - Dynamic micro-batching                            │
│  - Activation checkpointing (save/recompute/offload)  │
└──────────────────────────────────────────────────────┘
              │
              ▼
     [Optimized GPU Binary / PTX]

Modules

Module	Path	Description
Polyhedral	`include/symplex/polyhedral/`	Integer polytopes, affine maps, dependency analysis, iteration spaces
Schedule	`include/symplex/schedule/`	Schedule trees, tiling, operator fusion, GPU parallelization mapping
Hardware	`include/symplex/hardware/`	GPU topology, Tensor Core specs, memory hierarchy, roofline model
Optimizer	`include/symplex/optimizer/`	E-graph superoptimizer + 3-phase hardware search (roofline → symmetry → occupancy)
Cost Model	`include/symplex/costmodel/`	Roofline, analytical, empirical, and hybrid cost models
Codegen	`include/symplex/codegen/`	PTX emitter, WMMA/MMA instruction generation, swizzling, register allocation
Distributed	`include/symplex/distributed/`	Cluster mesh, SPMD sharding, NCCL bridge, pipeline overlap
Fault Tolerance	`include/symplex/fault_tolerance/`	Health monitoring, forward recovery, communicator repair, activation checkpointing
Training	`include/symplex/training/`	Training loop orchestrator, dynamic batch sizing, memory watchdog, JIT compiler pipeline

Quick Start

Prerequisites

C++20 compiler (GCC 12+, Clang 15+)
CMake 3.20+
Optional: CUDA Toolkit 12+ (for empirical profiling and GPU execution)
Optional: ISL (Integer Set Library) for enhanced polyhedral analysis
Optional: NCCL (for distributed training)

Build

git clone https://github.com/hollowguy898-cloud/SympleX.git
cd SympleX
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Run Tests

./symplex_tests

Run the Matmul Optimization Example

./example_matmul

This optimizes a 4096×4096×2048 matrix multiplication for the H100 GPU target, running the full 3-phase superoptimizer search and generating PTX code.

Usage

Optimize a Matrix Multiplication

#include "symplex/training/compiler_pipeline.h"
#include "symplex/hardware/hardware_target.h"

using namespace symplex::training;
using namespace symplex::hardware;

// Target: NVIDIA H100
HardwareTarget target = HardwareTarget::H100();

// Create compiler pipeline
CompilerPipeline pipeline(target);

// Optimize matmul C[M,N] += A[M,K] * B[K,N]
auto result = pipeline.compile_matmul(4096, 4096, 2048);

// result.ptx_source     — generated PTX kernel
// result.estimated_latency_ns  — predicted latency
// result.speedup_vs_naive     — speedup over naive tiling
// result.grid_dims / block_dims — GPU launch parameters

Use the Superoptimizer Directly

#include "symplex/optimizer/superoptimizer.h"
#include "symplex/polyhedral/iteration_space.h"
#include "symplex/hardware/hardware_target.h"

using namespace symplex::optimizer;
using namespace symplex::polyhedral;
using namespace symplex::hardware;

HardwareTarget target = HardwareTarget::H100();
Superoptimizer opt(target);

auto ispace = make_matmul_iteration_space(1024, 1024, 512);
auto result = opt.optimize(ispace);

// result.best_tile         — optimal TileConfig
// result.estimated_latency_ns — predicted latency
// result.speedup_vs_naive  — speedup vs smallest Tensor Core tile

Distributed Training with Fault Tolerance

#include "symplex/training/training_loop.h"
#include "symplex/hardware/hardware_target.h"

using namespace symplex::training;
using namespace symplex::hardware;

TrainingConfig config;
config.global_batch_size = 2048;
config.enable_fault_tolerance = true;
config.enable_dynamic_batching = true;

HardwareTarget target = HardwareTarget::H100();
TrainingLoop loop(config, target);

auto ispace = make_matmul_iteration_space(4096, 4096, 2048);
loop.initialize(ispace);

auto results = loop.execute_epoch();

Key Mathematical Concepts

Iteration Space (I)

Every AI loop nest is modeled as an integer polytope:

I = { i ∈ Z^n | A·i + b ≥ 0 }

Data Dependency Polyhedron (D)

Dependencies are vectors in the polyhedral space that must remain lexicographically positive:

d = i_sink - i_source,  d ≥ 0

Schedule Map (Φ)

The central optimization maps iteration points to hardware coordinates and time:

Φ(i) → (DeviceID, SM_ID, Warp_ID, Thread_ID, TimeStep)

3-Phase Superoptimizer Search (Level 2)

Roofline Pruning: Drop 90% of tile configurations using analytical operational intensity bounds
Compute-Symmetry Alignment: Only evaluate tile sizes that are exact multiples of Tensor Core fragment dimensions (16×8×16 for H100)
Hardware Occupancy Sieve: Micro-benchmark the top candidates, selecting the configuration that maximizes SM occupancy

Equality Saturation (Level 1 — the real superoptimizer)

Unlike traditional autotuners that only sweep hardware parameters, SympleX's superoptimizer explores the space of equivalent programs using equality saturation:

E-Graph Construction: The input program is represented as an e-graph — a data structure that compactly represents exponentially many equivalent expressions as equivalence classes
Rewrite Rule Application: Algebraic identities, fusion patterns, and tiling decompositions are applied iteratively, growing the e-graph to represent all discovered equivalent programs
Polyhedral Guardrails: Before any extracted program is accepted, it is validated against the original computation's data dependencies — rewrites that would violate semantics are rejected
Cost-Guided Extraction: The cheapest program is extracted from the saturated e-graph using a bottom-up dynamic programming approach, where fused operations (e.g., FusedMatMulReLU) have lower cost than their unfused equivalents (ReLU(MatMul(A,B)))

Hardware Targets

Built-in profiles for:

GPU	SMs	Tensor Core	HBM BW	SRAM/SM
H100 (Hopper)	132	16×8×16 FP16	3.35 TB/s	228 KB
B200 (Blackwell)	160	16×8×32 FP16	8.0 TB/s	304 KB
Generic	84	16×8×16 FP16	2.0 TB/s	164 KB

Custom targets can be constructed via HardwareTarget fields.

Project Structure

SympleX/
├── CMakeLists.txt
├── LICENSE                          # GNU AGPL v3
├── README.md
├── include/symplex/
│   ├── polyhedral/                  # Core polyhedral types
│   │   ├── integer_polytope.h
│   │   ├── affine_map.h
│   │   ├── dependency.h
│   │   ├── iteration_space.h
│   │   └── union_map.h
│   ├── schedule/                    # Schedule tree & transformations
│   │   ├── schedule_tree.h
│   │   ├── tiling.h
│   │   ├── fusion.h
│   │   ├── parallelization.h
│   │   └── schedule_map.h
│   ├── hardware/                    # GPU hardware models
│   │   └── hardware_target.h
│   ├── optimizer/                   # Superoptimizer search
│   │   ├── tile_config.h
│   │   ├── search_phase1.h
│   │   ├── search_phase2.h
│   │   ├── search_phase3.h
│   │   └── superoptimizer.h
│   ├── costmodel/                   # Performance cost models
│   │   ├── roofline.h
│   │   ├── analytical.h
│   │   ├── empirical.h
│   │   └── cost_model.h
│   ├── codegen/                     # PTX code generation
│   │   ├── wmma.h
│   │   ├── swizzle.h
│   │   ├── register_allocator.h
│   │   ├── ptx_emitter.h
│   │   └── code_generator.h
│   ├── distributed/                 # Distributed training
│   │   ├── mesh.h
│   │   ├── sharding.h
│   │   ├── nccl_bridge.h
│   │   └── pipeline_overlap.h
│   ├── fault_tolerance/             # Fault tolerance
│   │   ├── health_monitor.h
│   │   ├── forward_recovery.h
│   │   ├── communicator_repair.h
│   │   └── checkpoint.h
│   └── training/                    # Training orchestrator
│       ├── dynamic_batch.h
│       ├── memory_watchdog.h
│       ├── training_loop.h
│       └── compiler_pipeline.h
├── src/                             # Implementation files (mirrors include/)
├── tests/                           # Unit tests
├── benchmarks/                      # Performance benchmarks
└── examples/                        # Usage examples

License

GNU Affero General Public License v3 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.5.3

Jun 5, 2026

This version

1.5.2

Jun 5, 2026

1.5.1

Jun 5, 2026

1.5.0

Jun 4, 2026

1.4.3

Jun 4, 2026

1.4.2

Jun 4, 2026

1.4.1

Jun 4, 2026

1.4.0

Jun 4, 2026

1.3.0

Jun 4, 2026

1.2.0

Jun 4, 2026

1.1.0

Jun 4, 2026

1.0.0

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

simplex_tensor-1.5.2-cp312-cp312-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded Jun 5, 2026 CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file simplex_tensor-1.5.2-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: simplex_tensor-1.5.2-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Jun 5, 2026
Size: 1.1 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for simplex_tensor-1.5.2-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`566df180b3a61b77b26457581e228b7bcf7f77cb84da002929fcd0288e20b4d9`
MD5	`c34e241ca684af6f8e76c587e737c2ae`
BLAKE2b-256	`7e5ba23ca9235ac7b1bfcf8c9c6cfbdc35475f8278415f89b970fed51553102f`

See more details on using hashes here.

simplex-tensor 1.5.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SympleX — Polyhedral Tensor Superoptimizer

Architecture

Modules

Quick Start

Prerequisites

Build

Run Tests

Run the Matmul Optimization Example

Usage

Optimize a Matrix Multiplication

Use the Superoptimizer Directly

Distributed Training with Fault Tolerance

Key Mathematical Concepts

Iteration Space (I)

Data Dependency Polyhedron (D)

Schedule Map (Φ)

3-Phase Superoptimizer Search (Level 2)

Equality Saturation (Level 1 — the real superoptimizer)

Hardware Targets

Project Structure

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes