Stencil with optimized dataflow architecture
Project description
SODA: Stencil with Optimized Dataflow Architecture
Publication
- Yuze Chi, Jason Cong. Exploiting Computation Reuse for Stencil Accelerators. In DAC, 2020. [PDF] [Code]
- Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou. SODA: Stencil with Optimized Dataflow Architecture. In ICCAD, 2018. (Best Paper Candidate) [PDF] [Slides] [Code]
SODA DSL Example
# comments start with hashtag(#)
kernel: blur # the kernel name, will be used as the kernel name in HLS
burst width: 512 # I/O width in bits, for Xilinx platform 512 works the best
unroll factor: 16 # how many pixels are generated per cycle
# specify the dram bank, type, name, and dimension of the input tile
# the last dimension is not needed and a placeholder '*' must be given
# dram bank is optional
# multiple inputs can be specified but 1 and only 1 must specify the dimensions
input dram 0 uint16: input(2000, *)
# specify an intermediate stage of computation, may appear 0 or more times
local uint16: tmp(0, 0) = (input(-1, 0) + input(0, 0) + input(1, 0)) / 3
# specify the output
# dram bank is optional
output dram 1 uint16: output(0, 0) = (tmp(0, -1) + tmp(0, 0) + tmp(0, 1)) / 3
# how many times the whole computation is repeated (only works if input matches output)
iterate: 2
Getting Started
Prerequisites
- Python 3.6+ and corresponding
pip
How to install Python 3.6+ on Ubuntu 16.04+ and CentOS 7?
Ubuntu 16.04
sudo apt install software-properties-common python3-pip
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.6
Ubuntu 18.04+
sudo apt install python3 python3-pip
CentOS 7
sudo yum install python3 python3-pip
Install SODA
Install from PyPI
python3 -m pip install --user --upgrade sodac
- Replace
python3
with a more specific Python version higher than or equal topython3.6
, if necessary. - Make sure
~/.local/bin
is in yourPATH
, or replacesodac
withpython3 -m soda.sodac
below.
Generate Vivado HLS kernel code
sodac tests/src/blur.soda --xocl-kernel blur_kernel.cpp
Generate Intel OpenCL kernel code
sodac tests/src/blur.soda --iocl-kernel blur_kernel.cl
Generate Xilinx Object file with AXI MMAP Master Interface
Requries vivado_hls
.
sodac tests/src/blur.soda --xocl-hw-xo blur_kernel.hw.xo
Generate Xilinx Object file with AXI Stream Interface
Requries vivado_hls
.
sodac tests/src/blur.soda --xocl-hw-xo blur_kernel.hw.xo --xocl-interface axis
Apply Computation Reuse
sodac tests/src/blur.soda --computation-reuse --xocl-kernel blur_kernel.cpp
Code Snippets
Configuration
- 5-point 2D Jacobi:
t0(0, 0) = (t1(0, 1) + t1(1, 0) + t1(0, 0) + t1(0, -1) + t1(-1, 0)) * 0.2f
- tile size is
(2000, *)
Each function in the below code snippets is synthesized into an RTL module.
Their arguments are all hls::stream
FIFOs; Without unrolling, a simple line-buffer pipeline is generated, producing 1 pixel per cycle.
With unrolling, a SODA microarchitecture pipeline is generated, procuding 2 pixeles per cycle.
Without Unrolling
#pragma HLS dataflow
Module1Func(
/*output*/ &from_t1_offset_0_to_t1_offset_1999,
/*output*/ &from_t1_offset_0_to_t0_pe_0,
/* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
/*output*/ &from_t1_offset_1999_to_t1_offset_2000,
/*output*/ &from_t1_offset_1999_to_t0_pe_0,
/* input*/ &from_t1_offset_0_to_t1_offset_1999);
Module3Func(
/*output*/ &from_t1_offset_2000_to_t1_offset_2001,
/*output*/ &from_t1_offset_2000_to_t0_pe_0,
/* input*/ &from_t1_offset_1999_to_t1_offset_2000);
Module3Func(
/*output*/ &from_t1_offset_2001_to_t1_offset_4000,
/*output*/ &from_t1_offset_2001_to_t0_pe_0,
/* input*/ &from_t1_offset_2000_to_t1_offset_2001);
Module4Func(
/*output*/ &from_t1_offset_4000_to_t0_pe_0,
/* input*/ &from_t1_offset_2001_to_t1_offset_4000);
Module5Func(
/*output*/ &from_t0_pe_0_to_super_sink,
/* input*/ &from_t1_offset_0_to_t0_pe_0,
/* input*/ &from_t1_offset_1999_to_t0_pe_0,
/* input*/ &from_t1_offset_2000_to_t0_pe_0,
/* input*/ &from_t1_offset_4000_to_t0_pe_0,
/* input*/ &from_t1_offset_2001_to_t0_pe_0);
In the above code snippet, Module1Func
to Module4Func
are forwarding modules; they constitute the line buffer.
The line buffer size is approximately two lines of pixels, i.e. 4000 pixels.
Module5Func
is a computing module; it implements the computation kernel.
The whole design is fully pipelined; however, with only 1 computing module, it can only produce 1 pixel per cycle.
Unroll 2 Times
#pragma HLS dataflow
Module1Func(
/*output*/ &from_t1_offset_1_to_t1_offset_1999,
/*output*/ &from_t1_offset_1_to_t0_pe_0,
/* input*/ &from_super_source_to_t1_offset_1);
Module1Func(
/*output*/ &from_t1_offset_0_to_t1_offset_2000,
/*output*/ &from_t1_offset_0_to_t0_pe_1,
/* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
/*output*/ &from_t1_offset_1999_to_t1_offset_2001,
/*output*/ &from_t1_offset_1999_to_t0_pe_1,
/* input*/ &from_t1_offset_1_to_t1_offset_1999);
Module3Func(
/*output*/ &from_t1_offset_2000_to_t1_offset_2002,
/*output*/ &from_t1_offset_2000_to_t0_pe_1,
/*output*/ &from_t1_offset_2000_to_t0_pe_0,
/* input*/ &from_t1_offset_0_to_t1_offset_2000);
Module4Func(
/*output*/ &from_t1_offset_2001_to_t1_offset_4001,
/*output*/ &from_t1_offset_2001_to_t0_pe_1,
/*output*/ &from_t1_offset_2001_to_t0_pe_0,
/* input*/ &from_t1_offset_1999_to_t1_offset_2001);
Module5Func(
/*output*/ &from_t1_offset_2002_to_t1_offset_4000,
/*output*/ &from_t1_offset_2002_to_t0_pe_0,
/* input*/ &from_t1_offset_2000_to_t1_offset_2002);
Module6Func(
/*output*/ &from_t1_offset_4001_to_t0_pe_0,
/* input*/ &from_t1_offset_2001_to_t1_offset_4001);
Module7Func(
/*output*/ &from_t0_pe_0_to_super_sink,
/* input*/ &from_t1_offset_1_to_t0_pe_0,
/* input*/ &from_t1_offset_2000_to_t0_pe_0,
/* input*/ &from_t1_offset_2001_to_t0_pe_0,
/* input*/ &from_t1_offset_4001_to_t0_pe_0,
/* input*/ &from_t1_offset_2002_to_t0_pe_0);
Module8Func(
/*output*/ &from_t1_offset_4000_to_t0_pe_1,
/* input*/ &from_t1_offset_2002_to_t1_offset_4000);
Module7Func(
/*output*/ &from_t0_pe_1_to_super_sink,
/* input*/ &from_t1_offset_0_to_t0_pe_1,
/* input*/ &from_t1_offset_1999_to_t0_pe_1,
/* input*/ &from_t1_offset_2000_to_t0_pe_1,
/* input*/ &from_t1_offset_4000_to_t0_pe_1,
/* input*/ &from_t1_offset_2001_to_t0_pe_1);
In the above code snippet, Module1Func
to Module6Func
and Module8Func
are forwarding modules; they constitute the line buffers of the SODA microarchitecture.
Although unrolled, the line buffer size is still approximately two lines of pixels, i.e. 4000 pixels.
Module7Func
is a computing module; it is instanciated twice.
The whole design is fully pipelined and can produce 2 pixel per cycle.
In general, the unroll factor can be set to any number that satisfies the throughput requirement.
Best Practices
- For non-iterative stencil,
unroll factor
shall be determined by the DRAM bandwidth, i.e. saturate the external bandwidth, since the resource is usually not the bottleneck - For iterative stencil, to use more PEs in a single iteration or to implement more iterations is yet to be explored
- Note that
2.0
will be adouble
number. To generatefloat
, use2.0f
. This may help reduce DSP usage - SODA is tiling-based and the size of the tile is specified in the
input
keyword. The last dimension is omitted because it is not needed in the reuse buffer generation
Projects Using SODA
- Licheng Guo, Jason Lau, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru Zhang, Jason Cong . Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency. In DAC, 2020. [PDF]
- Jiajie Li, Yuze Chi, Jason Cong. HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration. In FPGA, 2020. [PDF] [Slides]
- Young-kyu Choi, Yuze Chi, Jie Wang, Jason Cong. FLASH: Fast, ParalleL, and Accurate Simulator for HLS. In TCAD, 2020. [PDF]
- Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, Zhiru Zhang. HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing. In FPGA, 2019. (Best Paper Candidate) [PDF] [Slides] [Code]
- Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. Rapid Cycle-Accurate Simulator for High-Level Synthesis. In FPGA, 2019. [PDF] [Slides]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file sodac-0.0.20220505.dev2.tar.gz
.
File metadata
- Download URL: sodac-0.0.20220505.dev2.tar.gz
- Upload date:
- Size: 82.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cafd4ac4c9b17a36c1b2be9770962cd7115d67bb97784812262c5bb7f19c72d7 |
|
MD5 | ee392eee13641e8da682b3dbd30b5ff3 |
|
BLAKE2b-256 | 0e2ecd240a7cd41dcf9d78984701235fe4ee37b2829f4165f59cc2d8bb3420ea |
File details
Details for the file sodac-0.0.20220505.dev2-py3.7.egg
.
File metadata
- Download URL: sodac-0.0.20220505.dev2-py3.7.egg
- Upload date:
- Size: 220.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41a87d9122831442c8ab1739ed12d562ea2bd8e0064b7e08134e080acb32ceb7 |
|
MD5 | d72afb46a6bdacab0fa5f14d1b356db4 |
|
BLAKE2b-256 | 896cb1f120b965c2476c50663b78959990a09c86bc15607a74705a8236759be1 |
File details
Details for the file sodac-0.0.20220505.dev2-py3-none-any.whl
.
File metadata
- Download URL: sodac-0.0.20220505.dev2-py3-none-any.whl
- Upload date:
- Size: 90.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6ce1a89b203814b7ace19555d38acf11e88ddb41e8ed378d26c1b4235ba13ca |
|
MD5 | 8035aa424fa295ad4e0b21b01be58929 |
|
BLAKE2b-256 | 910c0e5f495fc8f2aa464fdbc61f72ac3925df654347df66f152a1e0eee139e5 |