Skip to main content

Stencil with optimized dataflow architecture

Project description

SODA: Stencil with Optimized Dataflow Architecture

Publication

SODA DSL Example

# comments start with hashtag(#)

kernel: blur      # the kernel name, will be used as the kernel name in HLS
burst width: 512  # I/O width in bits, for Xilinx platform 512 works the best
unroll factor: 16 # how many pixels are generated per cycle

# specify the dram bank, type, name, and dimension of the input tile
# the last dimension is not needed and a placeholder '*' must be given
# dram bank is optional
# multiple inputs can be specified but 1 and only 1 must specify the dimensions
input dram 0 uint16: input(2000, *)

# specify an intermediate stage of computation, may appear 0 or more times
local uint16: tmp(0, 0) = (input(-1, 0) + input(0, 0) + input(1, 0)) / 3

# specify the output
# dram bank is optional
output dram 1 uint16: output(0, 0) = (tmp(0, -1) + tmp(0, 0) + tmp(0, 1)) / 3

# how many times the whole computation is repeated (only works if input matches output)
iterate: 2

Getting Started

Prerequisites

  • Python 3.6+ and corresponding pip
How to install Python 3.6+ on Ubuntu 16.04+ and CentOS 7?

Ubuntu 16.04

sudo apt install software-properties-common python3-pip
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.6

Ubuntu 18.04+

sudo apt install python3 python3-pip

CentOS 7

sudo yum install python3 python3-pip

Install SODA

Install from PyPI

python3 -m pip install --user --upgrade sodac
  • Replace python3 with a more specific Python version higher than or equal to python3.6, if necessary.
  • Make sure ~/.local/bin is in your PATH, or replace sodac with python3 -m soda.sodac below.

Generate Vivado HLS kernel code

sodac tests/src/blur.soda --xocl-kernel blur_kernel.cpp

Generate Intel OpenCL kernel code

sodac tests/src/blur.soda --iocl-kernel blur_kernel.cl

Generate Xilinx Object file with AXI MMAP Master Interface

Requries vivado_hls.

sodac tests/src/blur.soda --xocl-hw-xo blur_kernel.hw.xo

Generate Xilinx Object file with AXI Stream Interface

Requries vivado_hls.

sodac tests/src/blur.soda --xocl-hw-xo blur_kernel.hw.xo --xocl-interface axis

Apply Computation Reuse

sodac tests/src/blur.soda --computation-reuse --xocl-kernel blur_kernel.cpp

Code Snippets

Configuration

  • 5-point 2D Jacobi: t0(0, 0) = (t1(0, 1) + t1(1, 0) + t1(0, 0) + t1(0, -1) + t1(-1, 0)) * 0.2f
  • tile size is (2000, *)

Each function in the below code snippets is synthesized into an RTL module. Their arguments are all hls::stream FIFOs; Without unrolling, a simple line-buffer pipeline is generated, producing 1 pixel per cycle. With unrolling, a SODA microarchitecture pipeline is generated, procuding 2 pixeles per cycle.

Without Unrolling

#pragma HLS dataflow
Module1Func(
  /*output*/ &from_t1_offset_0_to_t1_offset_1999,
  /*output*/ &from_t1_offset_0_to_t0_pe_0,
  /* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
  /*output*/ &from_t1_offset_1999_to_t1_offset_2000,
  /*output*/ &from_t1_offset_1999_to_t0_pe_0,
  /* input*/ &from_t1_offset_0_to_t1_offset_1999);
Module3Func(
  /*output*/ &from_t1_offset_2000_to_t1_offset_2001,
  /*output*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t1_offset_2000);
Module3Func(
  /*output*/ &from_t1_offset_2001_to_t1_offset_4000,
  /*output*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t1_offset_2001);
Module4Func(
  /*output*/ &from_t1_offset_4000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t1_offset_4000);
Module5Func(
  /*output*/ &from_t0_pe_0_to_super_sink,
  /* input*/ &from_t1_offset_0_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_4000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t0_pe_0);

In the above code snippet, Module1Func to Module4Func are forwarding modules; they constitute the line buffer. The line buffer size is approximately two lines of pixels, i.e. 4000 pixels. Module5Func is a computing module; it implements the computation kernel. The whole design is fully pipelined; however, with only 1 computing module, it can only produce 1 pixel per cycle.

Unroll 2 Times

#pragma HLS dataflow
Module1Func(
  /*output*/ &from_t1_offset_1_to_t1_offset_1999,
  /*output*/ &from_t1_offset_1_to_t0_pe_0,
  /* input*/ &from_super_source_to_t1_offset_1);
Module1Func(
  /*output*/ &from_t1_offset_0_to_t1_offset_2000,
  /*output*/ &from_t1_offset_0_to_t0_pe_1,
  /* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
  /*output*/ &from_t1_offset_1999_to_t1_offset_2001,
  /*output*/ &from_t1_offset_1999_to_t0_pe_1,
  /* input*/ &from_t1_offset_1_to_t1_offset_1999);
Module3Func(
  /*output*/ &from_t1_offset_2000_to_t1_offset_2002,
  /*output*/ &from_t1_offset_2000_to_t0_pe_1,
  /*output*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_0_to_t1_offset_2000);
Module4Func(
  /*output*/ &from_t1_offset_2001_to_t1_offset_4001,
  /*output*/ &from_t1_offset_2001_to_t0_pe_1,
  /*output*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t1_offset_2001);
Module5Func(
  /*output*/ &from_t1_offset_2002_to_t1_offset_4000,
  /*output*/ &from_t1_offset_2002_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t1_offset_2002);
Module6Func(
  /*output*/ &from_t1_offset_4001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t1_offset_4001);
Module7Func(
  /*output*/ &from_t0_pe_0_to_super_sink,
  /* input*/ &from_t1_offset_1_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_4001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2002_to_t0_pe_0);
Module8Func(
  /*output*/ &from_t1_offset_4000_to_t0_pe_1,
  /* input*/ &from_t1_offset_2002_to_t1_offset_4000);
Module7Func(
  /*output*/ &from_t0_pe_1_to_super_sink,
  /* input*/ &from_t1_offset_0_to_t0_pe_1,
  /* input*/ &from_t1_offset_1999_to_t0_pe_1,
  /* input*/ &from_t1_offset_2000_to_t0_pe_1,
  /* input*/ &from_t1_offset_4000_to_t0_pe_1,
  /* input*/ &from_t1_offset_2001_to_t0_pe_1);

In the above code snippet, Module1Func to Module6Func and Module8Func are forwarding modules; they constitute the line buffers of the SODA microarchitecture. Although unrolled, the line buffer size is still approximately two lines of pixels, i.e. 4000 pixels. Module7Func is a computing module; it is instanciated twice. The whole design is fully pipelined and can produce 2 pixel per cycle. In general, the unroll factor can be set to any number that satisfies the throughput requirement.

Best Practices

  • For non-iterative stencil, unroll factor shall be determined by the DRAM bandwidth, i.e. saturate the external bandwidth, since the resource is usually not the bottleneck
  • For iterative stencil, to use more PEs in a single iteration or to implement more iterations is yet to be explored
  • Note that 2.0 will be a double number. To generate float, use 2.0f. This may help reduce DSP usage
  • SODA is tiling-based and the size of the tile is specified in the input keyword. The last dimension is omitted because it is not needed in the reuse buffer generation

Projects Using SODA

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sodac-0.0.20220505.dev2.tar.gz (82.7 kB view details)

Uploaded Source

Built Distributions

sodac-0.0.20220505.dev2-py3.7.egg (220.5 kB view details)

Uploaded Source

sodac-0.0.20220505.dev2-py3-none-any.whl (90.0 kB view details)

Uploaded Python 3

File details

Details for the file sodac-0.0.20220505.dev2.tar.gz.

File metadata

  • Download URL: sodac-0.0.20220505.dev2.tar.gz
  • Upload date:
  • Size: 82.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for sodac-0.0.20220505.dev2.tar.gz
Algorithm Hash digest
SHA256 cafd4ac4c9b17a36c1b2be9770962cd7115d67bb97784812262c5bb7f19c72d7
MD5 ee392eee13641e8da682b3dbd30b5ff3
BLAKE2b-256 0e2ecd240a7cd41dcf9d78984701235fe4ee37b2829f4165f59cc2d8bb3420ea

See more details on using hashes here.

File details

Details for the file sodac-0.0.20220505.dev2-py3.7.egg.

File metadata

  • Download URL: sodac-0.0.20220505.dev2-py3.7.egg
  • Upload date:
  • Size: 220.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for sodac-0.0.20220505.dev2-py3.7.egg
Algorithm Hash digest
SHA256 41a87d9122831442c8ab1739ed12d562ea2bd8e0064b7e08134e080acb32ceb7
MD5 d72afb46a6bdacab0fa5f14d1b356db4
BLAKE2b-256 896cb1f120b965c2476c50663b78959990a09c86bc15607a74705a8236759be1

See more details on using hashes here.

File details

Details for the file sodac-0.0.20220505.dev2-py3-none-any.whl.

File metadata

File hashes

Hashes for sodac-0.0.20220505.dev2-py3-none-any.whl
Algorithm Hash digest
SHA256 a6ce1a89b203814b7ace19555d38acf11e88ddb41e8ed378d26c1b4235ba13ca
MD5 8035aa424fa295ad4e0b21b01be58929
BLAKE2b-256 910c0e5f495fc8f2aa464fdbc61f72ac3925df654347df66f152a1e0eee139e5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page