Stream - Multi-core accelerator design space exploration with layer-fused scheduling
Project description
🌊 Stream
Stream is a design space exploration (DSE) and constraint-optimization framework for heterogeneous dataflow accelerators: accelerator systems built by combining cores that each have their own dataflow and performance model (AIE and TPU-like are two example core types among others). Scheduling is layer-fused, and the TETRA constraint optimization uses MILP (Mixed-Integer Linear Programming) to decide tensor placement and transfer paths across the cores of such a system. Stream builds on top of ZigZag for per-core cost estimation.
📖 Explore the Documentation
🚀 Getting Started Guide
✨ Key Features
✔ Heterogeneous dataflow cores: compose an accelerator from cores that each carry their own dataflow and cost model (AIE, TPU-like, pooling, SIMD, and more).
✔ Layer-fused scheduling across the whole system of cores.
✔ TETRA constraint optimization: a MILP (TransferAndTensorAllocator) decides tensor placement and transfer-path routing.
✔ Pluggable solver backends: OR-Tools GSCIP (default, license-free), OR-Tools HiGHS, and Gurobi behind one unified SolverModel API.
✔ ONNX workloads with auto-generated or hand-written mappings.
✔ AMD AIE code generation: emit aie / aiex MLIR for the Ryzen AI NPU, ready for the mlir-aie / IRON toolchain.
✔ Built for AI agents: an MCP server and typed IR models expose the pipeline programmatically.
The pipeline runs as a chain of stages: parse → tile → cost → MILP allocation → memory estimation.
🚀 Installation
Python >=3.12 is required.
Full install with MCP server support (from the repo root):
pip install -e ".[mcp]"
Base install (no MCP server):
pip install -e .
The authoritative dependency source is pyproject.toml (package stream-dse). The base install pulls in zigzag-dse, ortools>=9.15 (the default, license-free MILP backend), pydantic, pydot, and xdsl. Optional extras: [mcp] adds fastmcp (required for the MCP server); [gurobi] adds gurobipy (commercial solver, opt-in).
AIE code generation
AIE-target MLIR codegen and tracing additionally need the AMD AIE toolchain (mlir_aie, llvm-aie, xdsl-aie, snax-mlir, aie-python-extras). These are git/URL installs that PyPI does not allow in package metadata, so a console script installs them after the base install rather than via an extra:
pip install -e . # or, once published: pip install stream-dse
stream-setup-aie # installs the AIE toolchain into the current environment
stream-setup-aie --dry-run prints exactly what it will install without making changes.
⚠️ Platform caveat: the AIE toolchain is Linux x86_64 only (manylinux wheels), CPython 3.12 or 3.13.
💡 Solver license note: OR-Tools (
ortools_gscip, the default backend) is open-source and needs no license. Gurobi requires the[gurobi]extra (pip install -e ".[gurobi]") plus a separate commercial license;backend="gurobi"errors at solve time without a valid license.
Optional pre-commit setup:
pre-commit install
⚡ Quick Start
Run the CO pipeline on a small two-Conv workload (a committed test fixture) with an auto-generated mapping (approximately 11 seconds):
python scripts/main_stream_co.py \
--hardware stream/inputs/examples/hardware/tpu_like_quad_core.yaml \
--workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx
Or simply just co-2conv (this repo uses just as a task runner; it defaults to tpu_like_quad_core, see the matrix below). --mapping is omitted, so the mapping is auto-generated by the pipeline; the hardware is a TPU-like quad-core system.
Expected output:
Total latency: 14344.0
Group 0: 14344 (100.0%, wall=9.4s)
A YAML summary is written to outputs/.../summary.yaml with total_latency: 14344.0, plus workload/tiling/schedule PNG visualizations.
🧩 Hardware and Core Types
An accelerator in Stream is described as a system of heterogeneous dataflow cores. Core roles include compute, memory, shim, and offchip; example dataflow core types include AIE, TPU-like, and pooling.
Hardware and mapping files are organized as follows:
stream/inputs/examples/hardware/- system-level hardware YAMLs (e.g.tpu_like_quad_core.yaml,eyeriss_like_*.yaml,simba*.yaml,fusemax.yaml).stream/inputs/examples/hardware/cores/- per-core-type YAMLs (e.g.tpu_like.yaml,pooling.yaml,simd.yaml,offchip.yaml,eyeriss_like.yaml).stream/inputs/aie/hardware/andstream/inputs/aie/hardware/cores/- AMD AIE example core types (e.g.aie_tile.yaml,mem_tile_256KB.yaml,shim_dma.yaml).stream/inputs/examples/mapping/,stream/inputs/aie/mapping/, andstream/inputs/testing/mapping/- mapping descriptions.
A mapping can be auto-generated (as in Quick Start above) or hand-written and passed via --mapping.
📊 Workload × Hardware Matrix
The generic CO pipeline runs any ONNX workload on any of the example hardware systems. The repo ships two small workloads and exercises them across all eight non-AIE example architectures, both from the scripts/main_stream_co.py entry point and from the pytest suite (tests/test_hardware_combinations.py).
Workloads - committed test fixtures under stream/inputs/testing/workload/ (weight values are cleared, only tensor shapes matter for cost estimation, so the ONNX stay tiny; just gen-workloads regenerates them via the builders):
- 2-conv - two chained Conv layers (
make_2_conv.py). - swiglu - a 5-node SwiGLU block: two Gemms, SiLU, an elementwise Mul, and a down-projection Gemm (
make_swiglu.py).
Hardware (stream/inputs/examples/hardware/) |
Description | 2-conv | swiglu |
|---|---|---|---|
eyeriss_like_single_core |
one Eyeriss-like compute core (+ pooling, SIMD, DRAM) | ✓ | ✓ |
eyeriss_like_dual_core |
two Eyeriss-like compute cores | ✓ | ✓ |
eyeriss_like_quad_core |
four Eyeriss-like compute cores | ✓ | ✓ |
tpu_like_quad_core |
four TPU-like compute cores | ✓ | ✓ |
simba_small |
small Simba chiplet mesh | ✓ | ✓ |
simba |
36-core Simba chiplet mesh | ✓ | ✓ |
fusemax |
FuseMax array + vector + DRAM | ✓ | ✓ |
meta_prototype_dual_core_simd_offchip |
two Meta-prototype compute cores (+ pooling, SIMD, DRAM) | ✓ | ✓ |
✓ = completes through the generic CO pipeline. All combinations run in the default fast suite; on these small single-fusion-group workloads even the 36-core simba mesh finishes in seconds.
Run one combination - the justfile wraps scripts/main_stream_co.py; hw is any hardware stem from the table (default tpu_like_quad_core):
just co-2conv fusemax # 2-conv on an architecture
just co-swiglu simba_small # swiglu on an architecture
Equivalently, the raw entry-point call:
python scripts/main_stream_co.py \
--hardware stream/inputs/examples/hardware/fusemax.yaml \
--workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx
Run the whole matrix - the justfile wraps pytest tests/test_hardware_combinations.py, which runs 2-conv + swiglu over all eight architectures plus a parse-only check confirming every hardware definition loads:
just matrix # parse + 2-conv + swiglu over all 8 architectures (incl. simba)
🖥️ Command-Line Entry Points
All entry-point scripts live in scripts/ and are run from the repo root (so relative input paths resolve and stream imports as the installed package).
| Script | Purpose |
|---|---|
scripts/main_stream_co.py |
Generic CO pipeline for any workload + hardware pair; manual or auto-generated mapping; YAML summary output. General-purpose (non-AIE). |
scripts/main_gemm.py |
CO allocation + optional AIE MLIR codegen for GEMM workloads (AMD Strix AIE). |
scripts/main_swiglu.py |
CO allocation + optional AIE MLIR codegen for SwiGLU workloads (AMD Strix AIE). |
scripts/main_swiglu_dse_single.py |
Single-mapping SwiGLU DSE evaluation (AIE). |
scripts/main_swiglu_dse.py |
Multi-mapping SwiGLU DSE sweep over tile sizes (AIE). |
scripts/main_aie_co.py |
CO allocation for a hard-coded single AIE tile workload (no args; run as python scripts/main_aie_co.py). |
scripts/main_gemm_codegen.py |
Direct GEMM → AIE MLIR codegen via xDSL transforms (no CO pipeline); --M/--N/--K. |
scripts/main_stream_co.py is the general-purpose entry point. The others are AIE-specific: they hardwire AMD Strix or single-tile AIE hardware, and codegen requires NPU hardware. Note that scripts/main_aie_co.py takes no arguments (all paths are hard-coded). Plotting and trace post-processing utilities live in scripts/analysis/.
Full scripts/main_stream_co.py CLI syntax:
python scripts/main_stream_co.py \
--hardware PATH_TO_HW_YAML \
--workload PATH_TO_ONNX \
[--mapping PATH_TO_MAPPING_YAML] # omit for auto-generated mapping
[--output OUTPUT_DIR] # default: "outputs"
[--experiment-id ID]
[--skip-if-exists]
🐍 Public API
The public API lives in stream/api.py.
The primary entry point is optimize_allocation_co_generic, which auto-generates the mapping from the workload and hardware (no hand-written mapping YAML needed). This snippet is confirmed to run and print total_latency: 14344.0 (the 2-conv ONNX it references is produced by just gen-workloads):
import tempfile
from stream.api import configure_logging, optimize_allocation_co_generic
configure_logging()
with tempfile.TemporaryDirectory() as tmp:
ctx = optimize_allocation_co_generic(
hardware="stream/inputs/examples/hardware/tpu_like_quad_core.yaml",
workload="stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx",
experiment_id="my-first-run",
output_path=tmp,
)
print("total_latency:", ctx.get("total_latency"))
print("group_latencies:", ctx.get("group_latencies"))
Expected output: total_latency: 14344.0.
The other two public functions:
optimize_allocation_co_with_mapping(hardware, workload, mapping, experiment_id, output_path, ...)- runs CO with a hand-written mapping YAML.optimize_allocation_cois a backward-compatible alias for it (both names importable).optimize_mapping(hardware, workload, experiment_id, output_path, max_nb_mappings=20, ...)- DSE pipeline: enumerates mapping variants and runs CO for each.
All three return a StageContext. Useful keys: ctx.get("total_latency"), ctx.get("group_latencies"), ctx.get("scheduler"), ctx.get("workload"), ctx.get("accelerator").
🤖 MCP Server (for AI agents)
Stream ships an MCP server (stream/mcp/server.py, server name stream) that lets an AI agent submit and inspect TETRA CO jobs. Requires the [mcp] extra (pip install -e ".[mcp]").
Launch command (from the repo root):
python3 -c "from stream.mcp.server import mcp; mcp.run(transport='stdio')"
The server runs on STDIO (JSON-RPC) transport and blocks until the client disconnects.
The 6 tools:
| Tool | Purpose |
|---|---|
run_optimization(hardware, workload, mapping, output_path, backend, ...) |
Submit a TETRA CO job; returns a job_id immediately; solve runs in the background. |
poll_optimization(job_id) |
Check job status (pending / running / complete / failed / not_found). |
get_workload_ir(workload=None, experiment_id=None) |
Return the workload DAG as WorkloadIR JSON. |
get_accelerator_ir(hardware=None, experiment_id=None) |
Return the hardware model as AcceleratorIR JSON. |
get_allocation_ir(job_id) |
Return the TETRA allocation result as AllocationIR JSON (3 persona views). |
get_solve_stats(job_id) |
Return MILP solve statistics (objective, time, gap, node count, backend). |
Run / poll / inspect flow:
run_optimization(...)returns{"job_id": "...", "status": "pending"}.- Poll
poll_optimization(job_id)until{"status": "complete"}. - Inspect with
get_allocation_ir(job_id)for theAllocationIR(algorithmic / hardware / compiler views) andget_solve_stats(job_id)for solve statistics.
🧠 Working in This Repo (AI agents)
Programmatic / IR API for structured JSON output:
from stream.ir import WorkloadIR, AcceleratorIR, AllocationIR
# After running optimize_allocation_co_generic(...)
workload_ir = WorkloadIR.from_internal(ctx.get("workload"))
accelerator_ir = AcceleratorIR.from_internal(ctx.get("accelerator"))
allocation_ir = AllocationIR.from_internal(ctx.get("scheduler"))
workload_data = workload_ir.model_dump() # JSON-compatible dict
hardware_data = accelerator_ir.model_dump()
allocation_data = allocation_ir.model_dump()
AllocationIR offers .algorithmic_view(), .hardware_view(), and .compiler_view() persona views.
📚 Further Documentation
- Hosted documentation site: kuleuven-micas.github.io/stream, the human-facing docs (installation, getting started, the workload/hardware/mapping input formats, and driving Stream from an AI agent via the MCP server and IR models), rebuilt from
docs/on every push tomain. - Stream paper (IEEE): A. Symons, L. Mei, S. Colleman, P. Houshmand, S. Karl and M. Verhelst, "Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators".
- ZigZag: zigzag-project.github.io/zigzag, the per-core cost-estimation framework Stream builds on.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stream_dse-1.13.0.tar.gz.
File metadata
- Download URL: stream_dse-1.13.0.tar.gz
- Upload date:
- Size: 259.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f694b7939f2edc9dd3ff82c5d9dd32c76936173f8c78402bf8d043d33907f26
|
|
| MD5 |
47cf3feaa3c197e644cae94ed7769e02
|
|
| BLAKE2b-256 |
2cb085fa6cb505ee9dee77d83a86218d0568cfd452f8a951970bc3f7d843d9bd
|
File details
Details for the file stream_dse-1.13.0-py3-none-any.whl.
File metadata
- Download URL: stream_dse-1.13.0-py3-none-any.whl
- Upload date:
- Size: 290.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d04d9373883a0db01dee7023f790fbfffa9e8394fb80f65a2067e99e1c0cba5d
|
|
| MD5 |
b205a9ac0df19edec9fc6d36cb6a552b
|
|
| BLAKE2b-256 |
3d42c2b903828720d31fcd16080969fc9b0401cc6c425fb3458c39fa7bffd4f6
|