Stream - Multi-core accelerator design space exploration with layer-fused scheduling

These details have not been verified by PyPI

Project links

Homepage

Project description

🌊 Stream

Stream is a design space exploration (DSE) and constraint-optimization framework for heterogeneous dataflow accelerators: accelerator systems built by combining cores that each have their own dataflow and performance model (AIE and TPU-like are two example core types among others). Scheduling is layer-fused, and the TETRA constraint optimization uses MILP (Mixed-Integer Linear Programming) to decide tensor placement and transfer paths across the cores of such a system. Stream builds on top of ZigZag for per-core cost estimation.

📖 Explore the Documentation

🚀 Getting Started Guide

✨ Key Features

✔ Heterogeneous dataflow cores: compose an accelerator from cores that each carry their own dataflow and cost model (AIE, TPU-like, pooling, SIMD, and more).

✔ Layer-fused scheduling across the whole system of cores.

✔ TETRA constraint optimization: a MILP (TransferAndTensorAllocator) decides tensor placement and transfer-path routing.

✔ Pluggable solver backends: OR-Tools GSCIP (default, license-free), OR-Tools HiGHS, and Gurobi behind one unified SolverModel API.

✔ ONNX workloads with auto-generated or hand-written mappings.

✔ AMD AIE code generation: emit aie / aiex MLIR for the Ryzen AI NPU, ready for the mlir-aie / IRON toolchain.

✔ Built for AI agents: an MCP server and typed IR models expose the pipeline programmatically.

The pipeline runs as a chain of stages: parse → tile → cost → MILP allocation → memory estimation.

🚀 Installation

Python >=3.12 is required.

Full install with MCP server support (from the repo root):

pip install -e ".[mcp]"

Base install (no MCP server):

pip install -e .

The authoritative dependency source is pyproject.toml (package stream-dse). The base install pulls in zigzag-dse, ortools>=9.15 (the default, license-free MILP backend), pydantic, pydot, and xdsl. Optional extras: [mcp] adds fastmcp (required for the MCP server); [gurobi] adds gurobipy (commercial solver, opt-in).

AIE code generation

AIE-target MLIR codegen and tracing additionally need the AMD AIE toolchain (mlir_aie, llvm-aie, xdsl-aie, snax-mlir, aie-python-extras). These are git/URL installs that PyPI does not allow in package metadata, so a console script installs them after the base install rather than via an extra:

pip install -e .       # or, once published: pip install stream-dse
stream-setup-aie       # installs the AIE toolchain into the current environment

stream-setup-aie --dry-run prints exactly what it will install without making changes.

⚠️ Platform caveat: the AIE toolchain is Linux x86_64 only (manylinux wheels), CPython 3.12 or 3.13.

💡 Solver license note: OR-Tools (ortools_gscip, the default backend) is open-source and needs no license. Gurobi requires the [gurobi] extra (pip install -e ".[gurobi]") plus a separate commercial license; backend="gurobi" errors at solve time without a valid license.

Optional pre-commit setup:

pre-commit install

⚡ Quick Start

Run the CO pipeline on a small two-Conv workload (a committed test fixture) with an auto-generated mapping (approximately 11 seconds):

python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/tpu_like_quad_core.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx

Or simply just co-2conv (this repo uses just as a task runner; it defaults to tpu_like_quad_core, see the matrix below). --mapping is omitted, so the mapping is auto-generated by the pipeline; the hardware is a TPU-like quad-core system.

Expected output:

Total latency: 14344.0
  Group 0: 14344 (100.0%, wall=9.4s)

A YAML summary is written to outputs/.../summary.yaml with total_latency: 14344.0, plus workload/tiling/schedule PNG visualizations.

🧩 Hardware and Core Types

An accelerator in Stream is described as a system of heterogeneous dataflow cores. Core roles include compute, memory, shim, and offchip; example dataflow core types include AIE, TPU-like, and pooling.

Hardware and mapping files are organized as follows:

stream/inputs/examples/hardware/ - system-level hardware YAMLs (e.g. tpu_like_quad_core.yaml, eyeriss_like_*.yaml, simba*.yaml, fusemax.yaml).
stream/inputs/examples/hardware/cores/ - per-core-type YAMLs (e.g. tpu_like.yaml, pooling.yaml, simd.yaml, offchip.yaml, eyeriss_like.yaml).
stream/inputs/aie/hardware/ and stream/inputs/aie/hardware/cores/ - AMD AIE example core types (e.g. aie_tile.yaml, mem_tile_256KB.yaml, shim_dma.yaml).
stream/inputs/examples/mapping/, stream/inputs/aie/mapping/, and stream/inputs/testing/mapping/ - mapping descriptions.

A mapping can be auto-generated (as in Quick Start above) or hand-written and passed via --mapping.

📊 Workload × Hardware Matrix

The generic CO pipeline runs any ONNX workload on any of the example hardware systems. The repo ships two small workloads and exercises them across all eight non-AIE example architectures, both from the scripts/main_stream_co.py entry point and from the pytest suite (tests/test_hardware_combinations.py).

Workloads - committed test fixtures under stream/inputs/testing/workload/ (weight values are cleared, only tensor shapes matter for cost estimation, so the ONNX stay tiny; just gen-workloads regenerates them via the builders):

2-conv - two chained Conv layers (make_2_conv.py).
swiglu - a 5-node SwiGLU block: two Gemms, SiLU, an elementwise Mul, and a down-projection Gemm (make_swiglu.py).

Hardware (`stream/inputs/examples/hardware/`)	Description	2-conv	swiglu
`eyeriss_like_single_core`	one Eyeriss-like compute core (+ pooling, SIMD, DRAM)	✓	✓
`eyeriss_like_dual_core`	two Eyeriss-like compute cores	✓	✓
`eyeriss_like_quad_core`	four Eyeriss-like compute cores	✓	✓
`tpu_like_quad_core`	four TPU-like compute cores	✓	✓
`simba_small`	small Simba chiplet mesh	✓	✓
`simba`	36-core Simba chiplet mesh	✓	✓
`fusemax`	FuseMax array + vector + DRAM	✓	✓
`meta_prototype_dual_core_simd_offchip`	two Meta-prototype compute cores (+ pooling, SIMD, DRAM)	✓	✓

✓ = completes through the generic CO pipeline. All combinations run in the default fast suite; on these small single-fusion-group workloads even the 36-core simba mesh finishes in seconds.

Run one combination - the justfile wraps scripts/main_stream_co.py; hw is any hardware stem from the table (default tpu_like_quad_core):

just co-2conv fusemax           # 2-conv on an architecture
just co-swiglu simba_small      # swiglu on an architecture

Equivalently, the raw entry-point call:

python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/fusemax.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx

Run the whole matrix - the justfile wraps pytest tests/test_hardware_combinations.py, which runs 2-conv + swiglu over all eight architectures plus a parse-only check confirming every hardware definition loads:

just matrix          # parse + 2-conv + swiglu over all 8 architectures (incl. simba)

🖥️ Command-Line Entry Points

All entry-point scripts live in scripts/ and are run from the repo root (so relative input paths resolve and stream imports as the installed package).

Script	Purpose
`scripts/main_stream_co.py`	Generic CO pipeline for any workload + hardware pair; manual or auto-generated mapping; YAML summary output. General-purpose (non-AIE).
`scripts/main_gemm.py`	CO allocation + optional AIE MLIR codegen for GEMM workloads (AMD Strix AIE).
`scripts/main_swiglu.py`	CO allocation + optional AIE MLIR codegen for SwiGLU workloads (AMD Strix AIE).
`scripts/main_swiglu_dse_single.py`	Single-mapping SwiGLU DSE evaluation (AIE).
`scripts/main_swiglu_dse.py`	Multi-mapping SwiGLU DSE sweep over tile sizes (AIE).
`scripts/main_aie_co.py`	CO allocation for a hard-coded single AIE tile workload (no args; run as `python scripts/main_aie_co.py`).
`scripts/main_gemm_codegen.py`	Direct GEMM → AIE MLIR codegen via xDSL transforms (no CO pipeline); `--M/--N/--K`.

scripts/main_stream_co.py is the general-purpose entry point. The others are AIE-specific: they hardwire AMD Strix or single-tile AIE hardware, and codegen requires NPU hardware. Note that scripts/main_aie_co.py takes no arguments (all paths are hard-coded). Plotting and trace post-processing utilities live in scripts/analysis/.

Full scripts/main_stream_co.py CLI syntax:

python scripts/main_stream_co.py \
  --hardware PATH_TO_HW_YAML \
  --workload PATH_TO_ONNX \
  [--mapping PATH_TO_MAPPING_YAML]  # omit for auto-generated mapping
  [--output OUTPUT_DIR]             # default: "outputs"
  [--experiment-id ID]
  [--skip-if-exists]

🐍 Public API

The public API lives in stream/api.py.

The primary entry point is optimize_allocation_co_generic, which auto-generates the mapping from the workload and hardware (no hand-written mapping YAML needed). This snippet is confirmed to run and print total_latency: 14344.0 (the 2-conv ONNX it references is produced by just gen-workloads):

import tempfile
from stream.api import configure_logging, optimize_allocation_co_generic

configure_logging()

with tempfile.TemporaryDirectory() as tmp:
    ctx = optimize_allocation_co_generic(
        hardware="stream/inputs/examples/hardware/tpu_like_quad_core.yaml",
        workload="stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx",
        experiment_id="my-first-run",
        output_path=tmp,
    )
    print("total_latency:", ctx.get("total_latency"))
    print("group_latencies:", ctx.get("group_latencies"))

Expected output: total_latency: 14344.0.

The other two public functions:

optimize_allocation_co_with_mapping(hardware, workload, mapping, experiment_id, output_path, ...) - runs CO with a hand-written mapping YAML. optimize_allocation_co is a backward-compatible alias for it (both names importable).
optimize_mapping(hardware, workload, experiment_id, output_path, max_nb_mappings=20, ...) - DSE pipeline: enumerates mapping variants and runs CO for each.

All three return a StageContext. Useful keys: ctx.get("total_latency"), ctx.get("group_latencies"), ctx.get("scheduler"), ctx.get("workload"), ctx.get("accelerator").

🤖 MCP Server (for AI agents)

Stream ships an MCP server (stream/mcp/server.py, server name stream) that lets an AI agent submit and inspect TETRA CO jobs. Requires the [mcp] extra (pip install -e ".[mcp]").

⚠️ Install caveat: [mcp] does not currently resolve against the pinned PyPI xdsl 0.29.1 - fastmcp's dependency tree needs newer typing-extensions/pydantic than xdsl 0.29.1 permits. For now it installs only in the dev environment that uses the git build of xdsl; a clean fix awaits the xdsl upgrade.

Launch command (from the repo root):

python3 -c "from stream.mcp.server import mcp; mcp.run(transport='stdio')"

The server runs on STDIO (JSON-RPC) transport and blocks until the client disconnects.

The 6 tools:

Tool	Purpose
`run_optimization(hardware, workload, mapping, output_path, backend, ...)`	Submit a TETRA CO job; returns a `job_id` immediately; solve runs in the background.
`poll_optimization(job_id)`	Check job status (`pending` / `running` / `complete` / `failed` / `not_found`).
`get_workload_ir(workload=None, experiment_id=None)`	Return the workload DAG as `WorkloadIR` JSON.
`get_accelerator_ir(hardware=None, experiment_id=None)`	Return the hardware model as `AcceleratorIR` JSON.
`get_allocation_ir(job_id)`	Return the TETRA allocation result as `AllocationIR` JSON (3 persona views).
`get_solve_stats(job_id)`	Return MILP solve statistics (objective, time, gap, node count, backend).

Run / poll / inspect flow:

run_optimization(...) returns {"job_id": "...", "status": "pending"}.
Poll poll_optimization(job_id) until {"status": "complete"}.
Inspect with get_allocation_ir(job_id) for the AllocationIR (algorithmic / hardware / compiler views) and get_solve_stats(job_id) for solve statistics.

🧠 Working in This Repo (AI agents)

Programmatic / IR API for structured JSON output:

from stream.ir import WorkloadIR, AcceleratorIR, AllocationIR

# After running optimize_allocation_co_generic(...)
workload_ir = WorkloadIR.from_internal(ctx.get("workload"))
accelerator_ir = AcceleratorIR.from_internal(ctx.get("accelerator"))
allocation_ir = AllocationIR.from_internal(ctx.get("scheduler"))

workload_data = workload_ir.model_dump()      # JSON-compatible dict
hardware_data = accelerator_ir.model_dump()
allocation_data = allocation_ir.model_dump()

AllocationIR offers .algorithmic_view(), .hardware_view(), and .compiler_view() persona views.

📚 Further Documentation

Hosted documentation site: kuleuven-micas.github.io/stream, the human-facing docs (installation, getting started, the workload/hardware/mapping input formats, and driving Stream from an AI agent via the MCP server and IR models), rebuilt from docs/ on every push to main.
Stream paper (IEEE): A. Symons, L. Mei, S. Colleman, P. Houshmand, S. Karl and M. Verhelst, "Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators".
ZigZag: zigzag-project.github.io/zigzag, the per-core cost-estimation framework Stream builds on.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.13.3

Jun 17, 2026

1.13.2

Jun 17, 2026

1.13.1

Jun 16, 2026

1.13.0

Jun 15, 2026

0.0.8

Sep 5, 2023

0.0.7

Feb 9, 2023

0.0.6

Feb 9, 2023

0.0.5

Feb 9, 2023

0.0.4

Feb 8, 2023

0.0.3

Feb 8, 2023

0.0.2

Feb 8, 2023

0.0.1

Feb 6, 2023

0.0.0

Feb 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stream_dse-1.13.3.tar.gz (270.1 kB view details)

Uploaded Jun 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stream_dse-1.13.3-py3-none-any.whl (309.5 kB view details)

Uploaded Jun 17, 2026 Python 3

File details

Details for the file stream_dse-1.13.3.tar.gz.

File metadata

Download URL: stream_dse-1.13.3.tar.gz
Upload date: Jun 17, 2026
Size: 270.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for stream_dse-1.13.3.tar.gz
Algorithm	Hash digest
SHA256	`4da2938db41c1cebf9f722f14c3f41bb4c8f08cdb1a2309d60623989c199191a`
MD5	`184cc54d2a9942a409104e9ec18f2955`
BLAKE2b-256	`16a3b3c564e82417e0373822ec4405b7ac32b589e4510af3c0038e798dcbf7be`

See more details on using hashes here.

File details

Details for the file stream_dse-1.13.3-py3-none-any.whl.

File metadata

Download URL: stream_dse-1.13.3-py3-none-any.whl
Upload date: Jun 17, 2026
Size: 309.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for stream_dse-1.13.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3af37125c2c0fb144afa4428fd8956a42defae2bbae0bd4466f828416762a9a`
MD5	`a1b66e2697655393892da9fe58899e3a`
BLAKE2b-256	`90e5f6e83163a5f182af9ed238b12c2a3e97af1fd421f8903fbbebd0b5b83d11`

See more details on using hashes here.

stream-dse 1.13.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🌊 Stream

📖 Explore the Documentation

🚀 Getting Started Guide

✨ Key Features

🚀 Installation

AIE code generation

⚡ Quick Start

🧩 Hardware and Core Types

📊 Workload × Hardware Matrix

🖥️ Command-Line Entry Points

🐍 Public API

🤖 MCP Server (for AI agents)

🧠 Working in This Repo (AI agents)

📚 Further Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes