Skip to main content

Stencil with optimized dataflow architecture

Project description

SODA: Stencil with Optimized Dataflow Architecture

Publication

SODA DSL Example

# comments start with hashtag(#)

kernel: blur      # the kernel name, will be used as the kernel name in HLS
burst width: 512  # I/O width in bits, for Xilinx platform 512 works the best
unroll factor: 16 # how many pixels are generated per cycle

# specify the dram bank, type, name, and dimension of the input tile
# the last dimension is not needed and a placeholder '*' must be given
# dram bank is optional
# multiple inputs can be specified but 1 and only 1 must specify the dimensions
input dram 0 uint16: input(2000, *)

# specify an intermediate stage of computation, may appear 0 or more times
local uint16: tmp(0, 0) = (input(-1, 0) + input(0, 0) + input(1, 0)) / 3

# specify the output
# dram bank is optional
output dram 1 uint16: output(0, 0) = (tmp(0, -1) + tmp(0, 0) + tmp(0, 1)) / 3

# how many times the whole computation is repeated (only works if input matches output)
iterate: 2

Getting Started

Prerequisites

  • Python 3.6+ and corresponding pip
How to install Python 3.6+ on Ubuntu 16.04+ and CentOS 7?

Ubuntu 16.04

sudo apt install software-properties-common python3-pip
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.6

Ubuntu 18.04+

sudo apt install python3 python3-pip

CentOS 7

sudo yum install python3 python3-pip

Install SODA

Install from PyPI

python3 -m pip install --user --upgrade sodac
  • Replace python3 with a more specific Python version higher than or equal to python3.6, if necessary.
  • Make sure ~/.local/bin is in your PATH, or replace sodac with python3 -m soda.sodac below.

Generate Vivado HLS kernel code

sodac tests/src/blur.soda --xocl-kernel blur_kernel.cpp

Generate Intel OpenCL kernel code

sodac tests/src/blur.soda --iocl-kernel blur_kernel.cl

Generate Xilinx Object file with AXI MMAP Master Interface

Requries vivado_hls.

sodac tests/src/blur.soda --xocl-hw-xo blur_kernel.hw.xoh

Generate Xilinx Object file with AXI Stream Interface

Requries vivado_hls.

sodac tests/src/blur.soda --xocl-hw-xo blur_kernel.hw.xo --interface axis

Apply Computation Reuse

sodac tests/src/blur.soda --computation-reuse --xocl-kernel blur_kernel.cpp

Code Snippets

Configuration

  • 5-point 2D Jacobi: t0(0, 0) = (t1(0, 1) + t1(1, 0) + t1(0, 0) + t1(0, -1) + t1(-1, 0)) * 0.2f
  • tile size is (2000, *)

Each function in the below code snippets is synthesized into an RTL module. Their arguments are all hls::stream FIFOs; Without unrolling, a simple line-buffer pipeline is generated, producing 1 pixel per cycle. With unrolling, a SODA microarchitecture pipeline is generated, procuding 2 pixeles per cycle.

Without Unrolling

#pragma HLS dataflow
Module1Func(
  /*output*/ &from_t1_offset_0_to_t1_offset_1999,
  /*output*/ &from_t1_offset_0_to_t0_pe_0,
  /* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
  /*output*/ &from_t1_offset_1999_to_t1_offset_2000,
  /*output*/ &from_t1_offset_1999_to_t0_pe_0,
  /* input*/ &from_t1_offset_0_to_t1_offset_1999);
Module3Func(
  /*output*/ &from_t1_offset_2000_to_t1_offset_2001,
  /*output*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t1_offset_2000);
Module3Func(
  /*output*/ &from_t1_offset_2001_to_t1_offset_4000,
  /*output*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t1_offset_2001);
Module4Func(
  /*output*/ &from_t1_offset_4000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t1_offset_4000);
Module5Func(
  /*output*/ &from_t0_pe_0_to_super_sink,
  /* input*/ &from_t1_offset_0_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_4000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t0_pe_0);

In the above code snippet, Module1Func to Module4Func are forwarding modules; they constitute the line buffer. The line buffer size is approximately two lines of pixels, i.e. 4000 pixels. Module5Func is a computing module; it implements the computation kernel. The whole design is fully pipelined; however, with only 1 computing module, it can only produce 1 pixel per cycle.

Unroll 2 Times

#pragma HLS dataflow
Module1Func(
  /*output*/ &from_t1_offset_1_to_t1_offset_1999,
  /*output*/ &from_t1_offset_1_to_t0_pe_0,
  /* input*/ &from_super_source_to_t1_offset_1);
Module1Func(
  /*output*/ &from_t1_offset_0_to_t1_offset_2000,
  /*output*/ &from_t1_offset_0_to_t0_pe_1,
  /* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
  /*output*/ &from_t1_offset_1999_to_t1_offset_2001,
  /*output*/ &from_t1_offset_1999_to_t0_pe_1,
  /* input*/ &from_t1_offset_1_to_t1_offset_1999);
Module3Func(
  /*output*/ &from_t1_offset_2000_to_t1_offset_2002,
  /*output*/ &from_t1_offset_2000_to_t0_pe_1,
  /*output*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_0_to_t1_offset_2000);
Module4Func(
  /*output*/ &from_t1_offset_2001_to_t1_offset_4001,
  /*output*/ &from_t1_offset_2001_to_t0_pe_1,
  /*output*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t1_offset_2001);
Module5Func(
  /*output*/ &from_t1_offset_2002_to_t1_offset_4000,
  /*output*/ &from_t1_offset_2002_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t1_offset_2002);
Module6Func(
  /*output*/ &from_t1_offset_4001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t1_offset_4001);
Module7Func(
  /*output*/ &from_t0_pe_0_to_super_sink,
  /* input*/ &from_t1_offset_1_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_4001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2002_to_t0_pe_0);
Module8Func(
  /*output*/ &from_t1_offset_4000_to_t0_pe_1,
  /* input*/ &from_t1_offset_2002_to_t1_offset_4000);
Module7Func(
  /*output*/ &from_t0_pe_1_to_super_sink,
  /* input*/ &from_t1_offset_0_to_t0_pe_1,
  /* input*/ &from_t1_offset_1999_to_t0_pe_1,
  /* input*/ &from_t1_offset_2000_to_t0_pe_1,
  /* input*/ &from_t1_offset_4000_to_t0_pe_1,
  /* input*/ &from_t1_offset_2001_to_t0_pe_1);

In the above code snippet, Module1Func to Module6Func and Module8Func are forwarding modules; they constitute the line buffers of the SODA microarchitecture. Although unrolled, the line buffer size is still approximately two lines of pixels, i.e. 4000 pixels. Module7Func is a computing module; it is instanciated twice. The whole design is fully pipelined and can produce 2 pixel per cycle. In general, the unroll factor can be set to any number that satisfies the throughput requirement.

Best Practices

  • For non-iterative stencil, unroll factor shall be determined by the DRAM bandwidth, i.e. saturate the external bandwidth, since the resource is usually not the bottleneck
  • For iterative stencil, to use more PEs in a single iteration or to implement more iterations is yet to be explored
  • Note that 2.0 will be a double number. To generate float, use 2.0f. This may help reduce DSP usage
  • SODA is tiling-based and the size of the tile is specified in the input keyword. The last dimension is omitted because it is not needed in the reuse buffer generation

Projects Using SODA

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sodac-0.0.20200503.dev1.tar.gz (76.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sodac-0.0.20200503.dev1-py3.6.egg (206.9 kB view details)

Uploaded Egg

sodac-0.0.20200503.dev1-py3-none-any.whl (85.8 kB view details)

Uploaded Python 3

File details

Details for the file sodac-0.0.20200503.dev1.tar.gz.

File metadata

  • Download URL: sodac-0.0.20200503.dev1.tar.gz
  • Upload date:
  • Size: 76.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for sodac-0.0.20200503.dev1.tar.gz
Algorithm Hash digest
SHA256 b11fa26a161f635234b4af2b3db1e766612e9bae3e1f0ed37c38cc80faa99f97
MD5 9fa268706da24120a6be73877415bec7
BLAKE2b-256 8c7761f05579edd0bc2631d0364e948601c28c08e0181f22dd693a3dbc42025e

See more details on using hashes here.

File details

Details for the file sodac-0.0.20200503.dev1-py3.6.egg.

File metadata

  • Download URL: sodac-0.0.20200503.dev1-py3.6.egg
  • Upload date:
  • Size: 206.9 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for sodac-0.0.20200503.dev1-py3.6.egg
Algorithm Hash digest
SHA256 b8ae6757770ce311d1ea15e3a911006cdd26e16bb64e7184ce71414f1f3ecc53
MD5 f5f2ed6f6d15a0b1c233e8a3f1861e3e
BLAKE2b-256 87c0be8885f6a1bf3bfd7984e43044f72cbe270fa794505cc72501cdc0e583d1

See more details on using hashes here.

File details

Details for the file sodac-0.0.20200503.dev1-py3-none-any.whl.

File metadata

  • Download URL: sodac-0.0.20200503.dev1-py3-none-any.whl
  • Upload date:
  • Size: 85.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for sodac-0.0.20200503.dev1-py3-none-any.whl
Algorithm Hash digest
SHA256 aa68282830d328032252db6a565cecf4c21c2411f0f0dd59a5d456edd71400fd
MD5 6b9ec8ad7f1431fd7a69fa5a78cfcb0d
BLAKE2b-256 41ae79f9efa9289dd492ee07c393b45e00e8138c29b14de1cab97dfb93b784d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page