Skip to main content

Stencil with optimized dataflow architecture

Project description

SODA: Stencil with Optimized Dataflow Architecture

Publication

SODA DSL Example

# comments start with hashtag(#)

kernel: blur      # the kernel name, will be used as the kernel name in HLS
burst width: 512  # DRAM burst I/O width in bits, for Xilinx platform by default it's 512
unroll factor: 16 # how many pixels are generated per cycle

# specify the dram bank, type, name, and dimension of the input tile
# the last dimension is not needed and a placeholder '*' must be given
# dram bank is optional
# multiple inputs can be specified but 1 and only 1 must specify the dimensions
input dram 0 uint16: input(2000, *)

# specify an intermediate stage of computation, may appear 0 or more times
local uint16: tmp(0, 0) = (input(-1, 0) + input(0, 0) + input(1, 0)) / 3

# specify the output
# dram bank is optional
output dram 1 uint16: output(0, 0) = (tmp(0, -1) + tmp(0, 0) + tmp(0, 1)) / 3

# how many times the whole computation is repeated (only works if input matches output)
iterate: 2

# how to deal with border, currently only 'ignore' is available
border: ignore

# how to cluster modules, currently only 'none' is available
cluster: none

# constant values that may be referenced as coefficients or lookup tables (implementation currently broken)
# array partitioning information can be passed to HLS code
param uint16, partition cyclic factor=2 dim=1, partition cyclic factor=2 dim=2: p1[20][30]
# keyword 'dup' allows simultaneous access to the same parameter
param uint16, dup 3, partition complete: p2[20]

TODOs

  • support multiple inputs & outputs
  • use RTL flow to accelerate HLS

Design Considerations

  • All keywords are mandatory except intermediate local and extra param are optional
  • For non-iterative stencil, unroll factor shall be determined by the DRAM bandwidth, i.e. saturate the external bandwidth, since the resource is usually not the bottleneck
  • For iterative stencil, to use more PEs in a single iteration or to implement more iterations is yet to be explored
  • Currently math.h functions can be parsed but type induction is not fully implemented
  • Note that 2.0 will be a double number. To generate float, use 2.0f. This may help reduce DSP usage
  • SODA is tiling-based and the size of the tile is specified in the input keyword. The last dimension is omitted because it is not needed in the reuse buffer generation

Getting Started

Prerequisites

  • Python 3.3+
  • Python dependencies installed via python3 -m pip install -r requirements.txt
  • SDAccel 2018.3 (earlier versions might work but won't be supported)

Clone the Repo

git clone https://github.com/UCLA-VAST/soda.git
cd soda

Generate HLS kernel code

make kernel

Run C-Sim

make csim

Generate HDL code

make hls SYNTHESIS_FLOW=rtl

Run Co-Sim

make cosim SYNTHESIS_FLOW=rtl

Generate FPGA Bitstream

make bitstream SYNTHESIS_FLOW=rtl

Run Bitstream

make hw SYNTHESIS_FLOW=rtl # requires actual FPGA hardware and driver

Code Snippets

Configuration

  • 5-point 2D Jacobi: t0(0, 0) = (t1(0, 1) + t1(1, 0) + t1(0, 0) + t1(0, -1) + t1(-1, 0)) * 0.2f
  • tile size is (2000, *)

Each function in the below code snippets is synthesized into an RTL module. Their arguments are all hls::stream FIFOs; Without unrolling, a simple line-buffer pipeline is generated, producing 1 pixel per cycle. With unrolling, a SODA microarchitecture pipeline is generated, procuding 2 pixeles per cycle.

Without Unrolling

#pragma HLS dataflow
Module1Func(
  /*output*/ &from_t1_offset_0_to_t1_offset_1999,
  /*output*/ &from_t1_offset_0_to_t0_pe_0,
  /* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
  /*output*/ &from_t1_offset_1999_to_t1_offset_2000,
  /*output*/ &from_t1_offset_1999_to_t0_pe_0,
  /* input*/ &from_t1_offset_0_to_t1_offset_1999);
Module3Func(
  /*output*/ &from_t1_offset_2000_to_t1_offset_2001,
  /*output*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t1_offset_2000);
Module3Func(
  /*output*/ &from_t1_offset_2001_to_t1_offset_4000,
  /*output*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t1_offset_2001);
Module4Func(
  /*output*/ &from_t1_offset_4000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t1_offset_4000);
Module5Func(
  /*output*/ &from_t0_pe_0_to_super_sink,
  /* input*/ &from_t1_offset_0_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_4000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t0_pe_0);

In the above code snippet, Module1Func to Module4Func are forwarding modules; they constitute the line buffer. The line buffer size is approximately two lines of pixels, i.e. 4000 pixels. Module5Func is a computing module; it implements the computation kernel. The whole design is fully pipelined; however, with only 1 computing module, it can only produce 1 pixel per cycle.

Unroll 2 Times

#pragma HLS dataflow
Module1Func(
  /*output*/ &from_t1_offset_1_to_t1_offset_1999,
  /*output*/ &from_t1_offset_1_to_t0_pe_0,
  /* input*/ &from_super_source_to_t1_offset_1);
Module1Func(
  /*output*/ &from_t1_offset_0_to_t1_offset_2000,
  /*output*/ &from_t1_offset_0_to_t0_pe_1,
  /* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
  /*output*/ &from_t1_offset_1999_to_t1_offset_2001,
  /*output*/ &from_t1_offset_1999_to_t0_pe_1,
  /* input*/ &from_t1_offset_1_to_t1_offset_1999);
Module3Func(
  /*output*/ &from_t1_offset_2000_to_t1_offset_2002,
  /*output*/ &from_t1_offset_2000_to_t0_pe_1,
  /*output*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_0_to_t1_offset_2000);
Module4Func(
  /*output*/ &from_t1_offset_2001_to_t1_offset_4001,
  /*output*/ &from_t1_offset_2001_to_t0_pe_1,
  /*output*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t1_offset_2001);
Module5Func(
  /*output*/ &from_t1_offset_2002_to_t1_offset_4000,
  /*output*/ &from_t1_offset_2002_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t1_offset_2002);
Module6Func(
  /*output*/ &from_t1_offset_4001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t1_offset_4001);
Module7Func(
  /*output*/ &from_t0_pe_0_to_super_sink,
  /* input*/ &from_t1_offset_1_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_4001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2002_to_t0_pe_0);
Module8Func(
  /*output*/ &from_t1_offset_4000_to_t0_pe_1,
  /* input*/ &from_t1_offset_2002_to_t1_offset_4000);
Module7Func(
  /*output*/ &from_t0_pe_1_to_super_sink,
  /* input*/ &from_t1_offset_0_to_t0_pe_1,
  /* input*/ &from_t1_offset_1999_to_t0_pe_1,
  /* input*/ &from_t1_offset_2000_to_t0_pe_1,
  /* input*/ &from_t1_offset_4000_to_t0_pe_1,
  /* input*/ &from_t1_offset_2001_to_t0_pe_1);

In the above code snippet, Module1Func to Module6Func and Module8Func are forwarding modules; they constitute the line buffers of the SODA microarchitecture. Although unrolled, the line buffer size is still approximately two lines of pixels, i.e. 4000 pixels. Module7Func is a computing module; it is instanciated twice. The whole design is fully pipelined and can produce 2 pixel per cycle. In general, the unroll factor can be set to any number that satisfies the throughput requirement.

Projects Using SODA

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sodac-0.0.20200428.dev2.tar.gz (72.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sodac-0.0.20200428.dev2-py3.6.egg (196.4 kB view details)

Uploaded Egg

sodac-0.0.20200428.dev2-py3-none-any.whl (81.5 kB view details)

Uploaded Python 3

File details

Details for the file sodac-0.0.20200428.dev2.tar.gz.

File metadata

  • Download URL: sodac-0.0.20200428.dev2.tar.gz
  • Upload date:
  • Size: 72.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for sodac-0.0.20200428.dev2.tar.gz
Algorithm Hash digest
SHA256 c2478268c65838ce1f0ce7577f42107825ffaa7266960e9499d10d5650fa0c25
MD5 5c4fbb14fc7db0385cf4eb031e7313ed
BLAKE2b-256 8cdb17077f8bcd18b5fa26d9a9fbfbfb0b7c14888029cadfe107142f791db986

See more details on using hashes here.

File details

Details for the file sodac-0.0.20200428.dev2-py3.6.egg.

File metadata

  • Download URL: sodac-0.0.20200428.dev2-py3.6.egg
  • Upload date:
  • Size: 196.4 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for sodac-0.0.20200428.dev2-py3.6.egg
Algorithm Hash digest
SHA256 566c9439c94ad6671fc99c147048fa9bf0e8570f6a50e2d385727d9524f6aae0
MD5 d8aedd142c2602258c1887bfb55d1fdd
BLAKE2b-256 732f57c09e40a43b6463a0e2766ab162a8ea38e476a336cfbcf221df3a5ff086

See more details on using hashes here.

File details

Details for the file sodac-0.0.20200428.dev2-py3-none-any.whl.

File metadata

  • Download URL: sodac-0.0.20200428.dev2-py3-none-any.whl
  • Upload date:
  • Size: 81.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for sodac-0.0.20200428.dev2-py3-none-any.whl
Algorithm Hash digest
SHA256 77183e01fca31215935783880d01c6d4019bb05b1c1d57b649ac88475390b4fc
MD5 9eb217c3625ba487c83bd2481458a151
BLAKE2b-256 0dd36e5833dcb21a7ebe90c2ff16e6edc9c8c288defec174cc0512b6d1544441

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page