Skip to main content

Stream - Multi-core accelerator design space exploration with layer-fused scheduling

Project description

🌊 Stream

Ruff Python 3.12+ Docs

Stream is a design space exploration (DSE) and constraint-optimization framework for heterogeneous dataflow accelerators: accelerator systems built by combining cores that each have their own dataflow and performance model (AIE and TPU-like are two example core types among others). Scheduling is layer-fused, and the TETRA constraint optimization uses MILP (Mixed-Integer Linear Programming) to decide tensor placement and transfer paths across the cores of such a system. Stream builds on top of ZigZag for per-core cost estimation.

📖 Explore the Documentation

🚀 Getting Started Guide


✨ Key Features

Heterogeneous dataflow cores: compose an accelerator from cores that each carry their own dataflow and cost model (AIE, TPU-like, pooling, SIMD, and more).

Layer-fused scheduling across the whole system of cores.

TETRA constraint optimization: a MILP (TransferAndTensorAllocator) decides tensor placement and transfer-path routing.

Pluggable solver backends: OR-Tools GSCIP (default, license-free), OR-Tools HiGHS, and Gurobi behind one unified SolverModel API.

ONNX workloads with auto-generated or hand-written mappings.

AMD AIE code generation: emit aie / aiex MLIR for the Ryzen AI NPU, ready for the mlir-aie / IRON toolchain.

Built for AI agents: an MCP server and typed IR models expose the pipeline programmatically.

The pipeline runs as a chain of stages: parse → tile → cost → MILP allocation → memory estimation.


🚀 Installation

Python >=3.12 is required.

Full install with MCP server support (from the repo root):

pip install -e ".[mcp]"

Base install (no MCP server):

pip install -e .

The authoritative dependency source is pyproject.toml (package stream-dse). The base install pulls in zigzag-dse, ortools>=9.15 (the default, license-free MILP backend), pydantic, pydot, and xdsl. Optional extras: [mcp] adds fastmcp (required for the MCP server); [gurobi] adds gurobipy (commercial solver, opt-in).

AIE code generation

AIE-target MLIR codegen and tracing additionally need the AMD AIE toolchain (mlir_aie, llvm-aie, xdsl-aie, snax-mlir, aie-python-extras). These are git/URL installs that PyPI does not allow in package metadata, so a console script installs them after the base install rather than via an extra:

pip install -e .       # or, once published: pip install stream-dse
stream-setup-aie       # installs the AIE toolchain into the current environment

stream-setup-aie --dry-run prints exactly what it will install without making changes.

⚠️ Platform caveat: the AIE toolchain is Linux x86_64 only (manylinux wheels), CPython 3.12 or 3.13.

💡 Solver license note: OR-Tools (ortools_gscip, the default backend) is open-source and needs no license. Gurobi requires the [gurobi] extra (pip install -e ".[gurobi]") plus a separate commercial license; backend="gurobi" errors at solve time without a valid license.

Optional pre-commit setup:

pre-commit install

⚡ Quick Start

Run the CO pipeline on a small two-Conv workload (a committed test fixture) with an auto-generated mapping (approximately 11 seconds):

python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/tpu_like_quad_core.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx

Or simply just co-2conv (this repo uses just as a task runner; it defaults to tpu_like_quad_core, see the matrix below). --mapping is omitted, so the mapping is auto-generated by the pipeline; the hardware is a TPU-like quad-core system.

Expected output:

Total latency: 14344.0
  Group 0: 14344 (100.0%, wall=9.4s)

A YAML summary is written to outputs/.../summary.yaml with total_latency: 14344.0, plus workload/tiling/schedule PNG visualizations.


🧩 Hardware and Core Types

An accelerator in Stream is described as a system of heterogeneous dataflow cores. Core roles include compute, memory, shim, and offchip; example dataflow core types include AIE, TPU-like, and pooling.

Hardware and mapping files are organized as follows:

  • stream/inputs/examples/hardware/ - system-level hardware YAMLs (e.g. tpu_like_quad_core.yaml, eyeriss_like_*.yaml, simba*.yaml, fusemax.yaml).
  • stream/inputs/examples/hardware/cores/ - per-core-type YAMLs (e.g. tpu_like.yaml, pooling.yaml, simd.yaml, offchip.yaml, eyeriss_like.yaml).
  • stream/inputs/aie/hardware/ and stream/inputs/aie/hardware/cores/ - AMD AIE example core types (e.g. aie_tile.yaml, mem_tile_256KB.yaml, shim_dma.yaml).
  • stream/inputs/examples/mapping/, stream/inputs/aie/mapping/, and stream/inputs/testing/mapping/ - mapping descriptions.

A mapping can be auto-generated (as in Quick Start above) or hand-written and passed via --mapping.


📊 Workload × Hardware Matrix

The generic CO pipeline runs any ONNX workload on any of the example hardware systems. The repo ships two small workloads and exercises them across all eight non-AIE example architectures, both from the scripts/main_stream_co.py entry point and from the pytest suite (tests/test_hardware_combinations.py).

Workloads - committed test fixtures under stream/inputs/testing/workload/ (weight values are cleared, only tensor shapes matter for cost estimation, so the ONNX stay tiny; just gen-workloads regenerates them via the builders):

  • 2-conv - two chained Conv layers (make_2_conv.py).
  • swiglu - a 5-node SwiGLU block: two Gemms, SiLU, an elementwise Mul, and a down-projection Gemm (make_swiglu.py).
Hardware (stream/inputs/examples/hardware/) Description 2-conv swiglu
eyeriss_like_single_core one Eyeriss-like compute core (+ pooling, SIMD, DRAM)
eyeriss_like_dual_core two Eyeriss-like compute cores
eyeriss_like_quad_core four Eyeriss-like compute cores
tpu_like_quad_core four TPU-like compute cores
simba_small small Simba chiplet mesh
simba 36-core Simba chiplet mesh
fusemax FuseMax array + vector + DRAM
meta_prototype_dual_core_simd_offchip two Meta-prototype compute cores (+ pooling, SIMD, DRAM)

✓ = completes through the generic CO pipeline. All combinations run in the default fast suite; on these small single-fusion-group workloads even the 36-core simba mesh finishes in seconds.

Run one combination - the justfile wraps scripts/main_stream_co.py; hw is any hardware stem from the table (default tpu_like_quad_core):

just co-2conv fusemax           # 2-conv on an architecture
just co-swiglu simba_small      # swiglu on an architecture

Equivalently, the raw entry-point call:

python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/fusemax.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx

Run the whole matrix - the justfile wraps pytest tests/test_hardware_combinations.py, which runs 2-conv + swiglu over all eight architectures plus a parse-only check confirming every hardware definition loads:

just matrix          # parse + 2-conv + swiglu over all 8 architectures (incl. simba)

🖥️ Command-Line Entry Points

All entry-point scripts live in scripts/ and are run from the repo root (so relative input paths resolve and stream imports as the installed package).

Script Purpose
scripts/main_stream_co.py Generic CO pipeline for any workload + hardware pair; manual or auto-generated mapping; YAML summary output. General-purpose (non-AIE).
scripts/main_gemm.py CO allocation + optional AIE MLIR codegen for GEMM workloads (AMD Strix AIE).
scripts/main_swiglu.py CO allocation + optional AIE MLIR codegen for SwiGLU workloads (AMD Strix AIE).
scripts/main_swiglu_dse_single.py Single-mapping SwiGLU DSE evaluation (AIE).
scripts/main_swiglu_dse.py Multi-mapping SwiGLU DSE sweep over tile sizes (AIE).
scripts/main_aie_co.py CO allocation for a hard-coded single AIE tile workload (no args; run as python scripts/main_aie_co.py).
scripts/main_gemm_codegen.py Direct GEMM → AIE MLIR codegen via xDSL transforms (no CO pipeline); --M/--N/--K.

scripts/main_stream_co.py is the general-purpose entry point. The others are AIE-specific: they hardwire AMD Strix or single-tile AIE hardware, and codegen requires NPU hardware. Note that scripts/main_aie_co.py takes no arguments (all paths are hard-coded). Plotting and trace post-processing utilities live in scripts/analysis/.

Full scripts/main_stream_co.py CLI syntax:

python scripts/main_stream_co.py \
  --hardware PATH_TO_HW_YAML \
  --workload PATH_TO_ONNX \
  [--mapping PATH_TO_MAPPING_YAML]  # omit for auto-generated mapping
  [--output OUTPUT_DIR]             # default: "outputs"
  [--experiment-id ID]
  [--skip-if-exists]

🐍 Public API

The public API lives in stream/api.py.

The primary entry point is optimize_allocation_co_generic, which auto-generates the mapping from the workload and hardware (no hand-written mapping YAML needed). This snippet is confirmed to run and print total_latency: 14344.0 (the 2-conv ONNX it references is produced by just gen-workloads):

import tempfile
from stream.api import configure_logging, optimize_allocation_co_generic

configure_logging()

with tempfile.TemporaryDirectory() as tmp:
    ctx = optimize_allocation_co_generic(
        hardware="stream/inputs/examples/hardware/tpu_like_quad_core.yaml",
        workload="stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx",
        experiment_id="my-first-run",
        output_path=tmp,
    )
    print("total_latency:", ctx.get("total_latency"))
    print("group_latencies:", ctx.get("group_latencies"))

Expected output: total_latency: 14344.0.

The other two public functions:

  • optimize_allocation_co_with_mapping(hardware, workload, mapping, experiment_id, output_path, ...) - runs CO with a hand-written mapping YAML. optimize_allocation_co is a backward-compatible alias for it (both names importable).
  • optimize_mapping(hardware, workload, experiment_id, output_path, max_nb_mappings=20, ...) - DSE pipeline: enumerates mapping variants and runs CO for each.

All three return a StageContext. Useful keys: ctx.get("total_latency"), ctx.get("group_latencies"), ctx.get("scheduler"), ctx.get("workload"), ctx.get("accelerator").


🤖 MCP Server (for AI agents)

Stream ships an MCP server (stream/mcp/server.py, server name stream) that lets an AI agent submit and inspect TETRA CO jobs. Requires the [mcp] extra (pip install -e ".[mcp]").

⚠️ Install caveat: [mcp] does not currently resolve against the pinned PyPI xdsl 0.29.1 - fastmcp's dependency tree needs newer typing-extensions/pydantic than xdsl 0.29.1 permits. For now it installs only in the dev environment that uses the git build of xdsl; a clean fix awaits the xdsl upgrade.

Launch command (from the repo root):

python3 -c "from stream.mcp.server import mcp; mcp.run(transport='stdio')"

The server runs on STDIO (JSON-RPC) transport and blocks until the client disconnects.

The 6 tools:

Tool Purpose
run_optimization(hardware, workload, mapping, output_path, backend, ...) Submit a TETRA CO job; returns a job_id immediately; solve runs in the background.
poll_optimization(job_id) Check job status (pending / running / complete / failed / not_found).
get_workload_ir(workload=None, experiment_id=None) Return the workload DAG as WorkloadIR JSON.
get_accelerator_ir(hardware=None, experiment_id=None) Return the hardware model as AcceleratorIR JSON.
get_allocation_ir(job_id) Return the TETRA allocation result as AllocationIR JSON (3 persona views).
get_solve_stats(job_id) Return MILP solve statistics (objective, time, gap, node count, backend).

Run / poll / inspect flow:

  1. run_optimization(...) returns {"job_id": "...", "status": "pending"}.
  2. Poll poll_optimization(job_id) until {"status": "complete"}.
  3. Inspect with get_allocation_ir(job_id) for the AllocationIR (algorithmic / hardware / compiler views) and get_solve_stats(job_id) for solve statistics.

🧠 Working in This Repo (AI agents)

Programmatic / IR API for structured JSON output:

from stream.ir import WorkloadIR, AcceleratorIR, AllocationIR

# After running optimize_allocation_co_generic(...)
workload_ir = WorkloadIR.from_internal(ctx.get("workload"))
accelerator_ir = AcceleratorIR.from_internal(ctx.get("accelerator"))
allocation_ir = AllocationIR.from_internal(ctx.get("scheduler"))

workload_data = workload_ir.model_dump()      # JSON-compatible dict
hardware_data = accelerator_ir.model_dump()
allocation_data = allocation_ir.model_dump()

AllocationIR offers .algorithmic_view(), .hardware_view(), and .compiler_view() persona views.


📚 Further Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stream_dse-1.13.3.tar.gz (270.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stream_dse-1.13.3-py3-none-any.whl (309.5 kB view details)

Uploaded Python 3

File details

Details for the file stream_dse-1.13.3.tar.gz.

File metadata

  • Download URL: stream_dse-1.13.3.tar.gz
  • Upload date:
  • Size: 270.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for stream_dse-1.13.3.tar.gz
Algorithm Hash digest
SHA256 4da2938db41c1cebf9f722f14c3f41bb4c8f08cdb1a2309d60623989c199191a
MD5 184cc54d2a9942a409104e9ec18f2955
BLAKE2b-256 16a3b3c564e82417e0373822ec4405b7ac32b589e4510af3c0038e798dcbf7be

See more details on using hashes here.

File details

Details for the file stream_dse-1.13.3-py3-none-any.whl.

File metadata

  • Download URL: stream_dse-1.13.3-py3-none-any.whl
  • Upload date:
  • Size: 309.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for stream_dse-1.13.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e3af37125c2c0fb144afa4428fd8956a42defae2bbae0bd4466f828416762a9a
MD5 a1b66e2697655393892da9fe58899e3a
BLAKE2b-256 90e5f6e83163a5f182af9ed238b12c2a3e97af1fd421f8903fbbebd0b5b83d11

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page