Skip to main content

Stream - Multi-core accelerator design space exploration with layer-fused scheduling

Project description

🌊 Stream

Ruff Python 3.12+ Docs

Stream is a design space exploration (DSE) and constraint-optimization framework for heterogeneous dataflow accelerators: accelerator systems built by combining cores that each have their own dataflow and performance model (AIE and TPU-like are two example core types among others). Scheduling is layer-fused, and the TETRA constraint optimization uses MILP (Mixed-Integer Linear Programming) to decide tensor placement and transfer paths across the cores of such a system. Stream builds on top of ZigZag for per-core cost estimation.

📖 Explore the Documentation

🚀 Getting Started Guide


✨ Key Features

Heterogeneous dataflow cores: compose an accelerator from cores that each carry their own dataflow and cost model (AIE, TPU-like, pooling, SIMD, and more).

Layer-fused scheduling across the whole system of cores.

TETRA constraint optimization: a MILP (TransferAndTensorAllocator) decides tensor placement and transfer-path routing.

Pluggable solver backends: OR-Tools GSCIP (default, license-free), OR-Tools HiGHS, and Gurobi behind one unified SolverModel API.

ONNX workloads with auto-generated or hand-written mappings.

AMD AIE code generation: emit aie / aiex MLIR for the Ryzen AI NPU, ready for the mlir-aie / IRON toolchain.

Built for AI agents: an MCP server and typed IR models expose the pipeline programmatically.

The pipeline runs as a chain of stages: parse → tile → cost → MILP allocation → memory estimation.


🚀 Installation

Python >=3.12 is required.

Full install with MCP server support (from the repo root):

pip install -e ".[mcp]"

Base install (no MCP server):

pip install -e .

The authoritative dependency source is pyproject.toml (package stream-dse). The base install pulls in zigzag-dse, ortools>=9.15 (the default, license-free MILP backend), pydantic, pydot, and xdsl. Optional extras: [mcp] adds fastmcp (required for the MCP server); [gurobi] adds gurobipy (commercial solver, opt-in).

AIE code generation

AIE-target MLIR codegen and tracing additionally need the AMD AIE toolchain (mlir_aie, llvm-aie, xdsl-aie, snax-mlir, aie-python-extras). These are git/URL installs that PyPI does not allow in package metadata, so a console script installs them after the base install rather than via an extra:

pip install -e .       # or, once published: pip install stream-dse
stream-setup-aie       # installs the AIE toolchain into the current environment

stream-setup-aie --dry-run prints exactly what it will install without making changes.

⚠️ Platform caveat: the AIE toolchain is Linux x86_64 only (manylinux wheels), CPython 3.12 or 3.13.

💡 Solver license note: OR-Tools (ortools_gscip, the default backend) is open-source and needs no license. Gurobi requires the [gurobi] extra (pip install -e ".[gurobi]") plus a separate commercial license; backend="gurobi" errors at solve time without a valid license.

Optional pre-commit setup:

pre-commit install

⚡ Quick Start

Run the CO pipeline on a small two-Conv workload (a committed test fixture) with an auto-generated mapping (approximately 11 seconds):

python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/tpu_like_quad_core.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx

Or simply just co-2conv (this repo uses just as a task runner; it defaults to tpu_like_quad_core, see the matrix below). --mapping is omitted, so the mapping is auto-generated by the pipeline; the hardware is a TPU-like quad-core system.

Expected output:

Total latency: 14344.0
  Group 0: 14344 (100.0%, wall=9.4s)

A YAML summary is written to outputs/.../summary.yaml with total_latency: 14344.0, plus workload/tiling/schedule PNG visualizations.


🧩 Hardware and Core Types

An accelerator in Stream is described as a system of heterogeneous dataflow cores. Core roles include compute, memory, shim, and offchip; example dataflow core types include AIE, TPU-like, and pooling.

Hardware and mapping files are organized as follows:

  • stream/inputs/examples/hardware/ - system-level hardware YAMLs (e.g. tpu_like_quad_core.yaml, eyeriss_like_*.yaml, simba*.yaml, fusemax.yaml).
  • stream/inputs/examples/hardware/cores/ - per-core-type YAMLs (e.g. tpu_like.yaml, pooling.yaml, simd.yaml, offchip.yaml, eyeriss_like.yaml).
  • stream/inputs/aie/hardware/ and stream/inputs/aie/hardware/cores/ - AMD AIE example core types (e.g. aie_tile.yaml, mem_tile_256KB.yaml, shim_dma.yaml).
  • stream/inputs/examples/mapping/, stream/inputs/aie/mapping/, and stream/inputs/testing/mapping/ - mapping descriptions.

A mapping can be auto-generated (as in Quick Start above) or hand-written and passed via --mapping.


📊 Workload × Hardware Matrix

The generic CO pipeline runs any ONNX workload on any of the example hardware systems. The repo ships two small workloads and exercises them across all eight non-AIE example architectures, both from the scripts/main_stream_co.py entry point and from the pytest suite (tests/test_hardware_combinations.py).

Workloads - committed test fixtures under stream/inputs/testing/workload/ (weight values are cleared, only tensor shapes matter for cost estimation, so the ONNX stay tiny; just gen-workloads regenerates them via the builders):

  • 2-conv - two chained Conv layers (make_2_conv.py).
  • swiglu - a 5-node SwiGLU block: two Gemms, SiLU, an elementwise Mul, and a down-projection Gemm (make_swiglu.py).
Hardware (stream/inputs/examples/hardware/) Description 2-conv swiglu
eyeriss_like_single_core one Eyeriss-like compute core (+ pooling, SIMD, DRAM)
eyeriss_like_dual_core two Eyeriss-like compute cores
eyeriss_like_quad_core four Eyeriss-like compute cores
tpu_like_quad_core four TPU-like compute cores
simba_small small Simba chiplet mesh
simba 36-core Simba chiplet mesh
fusemax FuseMax array + vector + DRAM
meta_prototype_dual_core_simd_offchip two Meta-prototype compute cores (+ pooling, SIMD, DRAM)

✓ = completes through the generic CO pipeline. All combinations run in the default fast suite; on these small single-fusion-group workloads even the 36-core simba mesh finishes in seconds.

Run one combination - the justfile wraps scripts/main_stream_co.py; hw is any hardware stem from the table (default tpu_like_quad_core):

just co-2conv fusemax           # 2-conv on an architecture
just co-swiglu simba_small      # swiglu on an architecture

Equivalently, the raw entry-point call:

python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/fusemax.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx

Run the whole matrix - the justfile wraps pytest tests/test_hardware_combinations.py, which runs 2-conv + swiglu over all eight architectures plus a parse-only check confirming every hardware definition loads:

just matrix          # parse + 2-conv + swiglu over all 8 architectures (incl. simba)

🖥️ Command-Line Entry Points

All entry-point scripts live in scripts/ and are run from the repo root (so relative input paths resolve and stream imports as the installed package).

Script Purpose
scripts/main_stream_co.py Generic CO pipeline for any workload + hardware pair; manual or auto-generated mapping; YAML summary output. General-purpose (non-AIE).
scripts/main_gemm.py CO allocation + optional AIE MLIR codegen for GEMM workloads (AMD Strix AIE).
scripts/main_swiglu.py CO allocation + optional AIE MLIR codegen for SwiGLU workloads (AMD Strix AIE).
scripts/main_swiglu_dse_single.py Single-mapping SwiGLU DSE evaluation (AIE).
scripts/main_swiglu_dse.py Multi-mapping SwiGLU DSE sweep over tile sizes (AIE).
scripts/main_aie_co.py CO allocation for a hard-coded single AIE tile workload (no args; run as python scripts/main_aie_co.py).
scripts/main_gemm_codegen.py Direct GEMM → AIE MLIR codegen via xDSL transforms (no CO pipeline); --M/--N/--K.

scripts/main_stream_co.py is the general-purpose entry point. The others are AIE-specific: they hardwire AMD Strix or single-tile AIE hardware, and codegen requires NPU hardware. Note that scripts/main_aie_co.py takes no arguments (all paths are hard-coded). Plotting and trace post-processing utilities live in scripts/analysis/.

Full scripts/main_stream_co.py CLI syntax:

python scripts/main_stream_co.py \
  --hardware PATH_TO_HW_YAML \
  --workload PATH_TO_ONNX \
  [--mapping PATH_TO_MAPPING_YAML]  # omit for auto-generated mapping
  [--output OUTPUT_DIR]             # default: "outputs"
  [--experiment-id ID]
  [--skip-if-exists]

🐍 Public API

The public API lives in stream/api.py.

The primary entry point is optimize_allocation_co_generic, which auto-generates the mapping from the workload and hardware (no hand-written mapping YAML needed). This snippet is confirmed to run and print total_latency: 14344.0 (the 2-conv ONNX it references is produced by just gen-workloads):

import tempfile
from stream.api import configure_logging, optimize_allocation_co_generic

configure_logging()

with tempfile.TemporaryDirectory() as tmp:
    ctx = optimize_allocation_co_generic(
        hardware="stream/inputs/examples/hardware/tpu_like_quad_core.yaml",
        workload="stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx",
        experiment_id="my-first-run",
        output_path=tmp,
    )
    print("total_latency:", ctx.get("total_latency"))
    print("group_latencies:", ctx.get("group_latencies"))

Expected output: total_latency: 14344.0.

The other two public functions:

  • optimize_allocation_co_with_mapping(hardware, workload, mapping, experiment_id, output_path, ...) - runs CO with a hand-written mapping YAML. optimize_allocation_co is a backward-compatible alias for it (both names importable).
  • optimize_mapping(hardware, workload, experiment_id, output_path, max_nb_mappings=20, ...) - DSE pipeline: enumerates mapping variants and runs CO for each.

All three return a StageContext. Useful keys: ctx.get("total_latency"), ctx.get("group_latencies"), ctx.get("scheduler"), ctx.get("workload"), ctx.get("accelerator").


🤖 MCP Server (for AI agents)

Stream ships an MCP server (stream/mcp/server.py, server name stream) that lets an AI agent submit and inspect TETRA CO jobs. Requires the [mcp] extra (pip install -e ".[mcp]").

Launch command (from the repo root):

python3 -c "from stream.mcp.server import mcp; mcp.run(transport='stdio')"

The server runs on STDIO (JSON-RPC) transport and blocks until the client disconnects.

The 6 tools:

Tool Purpose
run_optimization(hardware, workload, mapping, output_path, backend, ...) Submit a TETRA CO job; returns a job_id immediately; solve runs in the background.
poll_optimization(job_id) Check job status (pending / running / complete / failed / not_found).
get_workload_ir(workload=None, experiment_id=None) Return the workload DAG as WorkloadIR JSON.
get_accelerator_ir(hardware=None, experiment_id=None) Return the hardware model as AcceleratorIR JSON.
get_allocation_ir(job_id) Return the TETRA allocation result as AllocationIR JSON (3 persona views).
get_solve_stats(job_id) Return MILP solve statistics (objective, time, gap, node count, backend).

Run / poll / inspect flow:

  1. run_optimization(...) returns {"job_id": "...", "status": "pending"}.
  2. Poll poll_optimization(job_id) until {"status": "complete"}.
  3. Inspect with get_allocation_ir(job_id) for the AllocationIR (algorithmic / hardware / compiler views) and get_solve_stats(job_id) for solve statistics.

🧠 Working in This Repo (AI agents)

Programmatic / IR API for structured JSON output:

from stream.ir import WorkloadIR, AcceleratorIR, AllocationIR

# After running optimize_allocation_co_generic(...)
workload_ir = WorkloadIR.from_internal(ctx.get("workload"))
accelerator_ir = AcceleratorIR.from_internal(ctx.get("accelerator"))
allocation_ir = AllocationIR.from_internal(ctx.get("scheduler"))

workload_data = workload_ir.model_dump()      # JSON-compatible dict
hardware_data = accelerator_ir.model_dump()
allocation_data = allocation_ir.model_dump()

AllocationIR offers .algorithmic_view(), .hardware_view(), and .compiler_view() persona views.


📚 Further Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stream_dse-1.13.0.tar.gz (259.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stream_dse-1.13.0-py3-none-any.whl (290.6 kB view details)

Uploaded Python 3

File details

Details for the file stream_dse-1.13.0.tar.gz.

File metadata

  • Download URL: stream_dse-1.13.0.tar.gz
  • Upload date:
  • Size: 259.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for stream_dse-1.13.0.tar.gz
Algorithm Hash digest
SHA256 8f694b7939f2edc9dd3ff82c5d9dd32c76936173f8c78402bf8d043d33907f26
MD5 47cf3feaa3c197e644cae94ed7769e02
BLAKE2b-256 2cb085fa6cb505ee9dee77d83a86218d0568cfd452f8a951970bc3f7d843d9bd

See more details on using hashes here.

File details

Details for the file stream_dse-1.13.0-py3-none-any.whl.

File metadata

  • Download URL: stream_dse-1.13.0-py3-none-any.whl
  • Upload date:
  • Size: 290.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for stream_dse-1.13.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d04d9373883a0db01dee7023f790fbfffa9e8394fb80f65a2067e99e1c0cba5d
MD5 b205a9ac0df19edec9fc6d36cb6a552b
BLAKE2b-256 3d42c2b903828720d31fcd16080969fc9b0401cc6c425fb3458c39fa7bffd4f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page