Skip to main content

Extending High-Level Synthesis for Task-Parallel Programs

Project description

Extending High-Level Synthesis for Task-Parallel Programs

Feature Synopsis

Programmability Improvement

Beyond Programmability

TAPA vs State-of-the-Art

In this project, we target PCIe-based FPGA accelerators that are commonly available in data centers (e.g., Amazon EC2 F1, Intel PAC D5005, Xilinx Alveo U250). In additional to the traditional register-transfer level (RTL) programming approach, these accelerators can also be programmed natively using vendor high-level synthesis (HLS) tools, e.g., Intel FPGA SDK for OpenCL and Xilinx Vitis. Both tools support task-parallel programs in HLS.

Intel FPGA SDK for OpenCL supports multiple OpenCL kernels running in parallel. The kernels communicate with each other via channel. Blocking/non-blocking read/write are supported. Both the kernel instances and the communication channels are global variables, which has several implications:

  • Hierarchical designs must be flattened;
  • Sharing code with channel accesses among kernels is difficult;
  • Accessed channels are not clearly visible for each kernel;
  • Functionally identical kernels must be high-level–synthesized separately.

The number of kernels is limited to 256 due to limitations in software simulation.

Xilinx Vitis supports two different task-parallel granularities. Fine-grained task-parallelism is supported via #pragma HLS dataflow. In a dataflow region, C++ function calls are interpreted as if they run in parallel. The communication interface for each C++ function is hls::stream<T>&, which generates ap_fifo RTL interface with simple ready-valid handshake. Blocking/non-blocking read/write are supported. The communication channels are instantiated using the same hls::stream<T> C++ class in the dataflow region. Hierarchical dataflow regions are possible. However, functionally identical kernels are still high-level–synthesized separately.

Coarse-grained task-parallelism is supported by linking multiple OpenCL kernels together (“k2k streaming”). Each OpenCL kernel itself may leverage #pragma HLS dataflow, thus enabling task-parallelism at a higher-level of granularity. Inter–OpenCL kernel communication uses a more-complicated and less-general hls::stream<ap_axiu<width, 0, 0, 0>>& construct, which generates AXI4-Stream (axis) RTL interface. In addition to blocking/non-blocking read/write operations, axis also provides a last bit that can implement transactions. However, leveraging such task-parallelism requires additional information beyond the source code at link time, which complicates the overall workflow.

Compared with these state of the art, TAPA provides a more flexible and productive development experience. In many cases, TAPA can even improve the quality of result (QoR). Details are as follows.

Kernel Communication Interface

In addition to blocking/non-blocking read/write operations, TAPA also supports peeking and transactions. Peeking is defined as reading a token from a channel without removing the token from the channel. This makes it possible to conditionally read tokens based on the content of tokens. Transactions are supported by providing a special end-of-transaction (eot) token for each channel. Such a token can be used to indicate the completion of a transaction without additional coding constructs.

TAPA uses separate tapa::istream<T>& and tapa::ostream<T>& classes for communication interfaces. This makes the communication directions of each interface visually obvious in the source code. IDEs will also only suggest available functions for code completion (e.g., read won't show up for tapa::ostream<T>&).

Host-Kernel Interface

Both Intel FPGA SDK for OpenCL and Xilinx Vitis provide high-level host-kernel interface as specified in the OpenCL standard. While programmers are relived from writing their own operating system drivers, they still have to write lengthy OpenCL host API calls. For example, the hello world example in Vitis Accel Examples spent 57 lines of code just to call the accelerator from the host. Worse, running software simulation, hardware simulation, and on-board execution requires different environment variable setup in addition to different bitstream specifications.

TAPA cuts this down to a single function call to tapa::invoke and takes care of environmental variable setup. Programmers only need to change the bitstream specification for different targets.

tapa::invoke(
  vadd,                                     // Top-level kernel function.
  bitstream,                                // Path to bitstream. If empty, run software simulation.
  tapa::read_only_mmap<unsigned int>(in1),  // Kernel argument 0, read-only array.
  tapa::read_only_mmap<unsigned int>(in2),  // Kernel argument 1, read-only array.
  tapa::write_only_mmap<int>(out_r),        // Kernel argument 2, write-only array.
  size);                                    // Kernel argument 3, scalar.

Software Simulation

C++ functions in a #pragma HLS dataflow region is called sequentially for software simulation. This works correctly only for programs that do not have cyclic communication patterns (e.g., an image-processing pipeline).

A common approach that fixes the correctness issue is to launch a thread for each task (e.g., OpenCL kernel). This approach however, has limited scalability due to contention caused by preemptive threads managed by the operating system.

TAPA takes a coroutine-based approach to solve the scalability problem. Coroutines are managed by user space programs and do not have the contention issue. Moreover, context switch between coroutines are much faster than threads.

RTL Code Generation

Although Vitis HLS can reuse HLS results for different instances of the same C++ function, it is not supported in a #pragma HLS dataflow region. Both Intel FPGA SDK for OpenCL and Xilinx Vitis HLS run HLS for each instance of tasks as if they are different for task-parallel programs. This can lead to long compilation time for programs with many identical task instances, e.g., as in SODA and AutoSA. To work around this problem, SODA implements an RTL backend that calls Vivado HLS for each task and then create the top-level RTL code by itself. TAPA generalizes this approach and makes it available to all task-parallel programs.

Memory System

Vitis HLS provides DRAM access via AXI4 memory-mapped (m_axi) interfaces. However, in a #pragma HLS dataflow regions, task instances may not access the same m_axi interface.

TAPA provides tapa::mmap that has the same functionality as Vitis HLS's m_axi interfaces. TAPA additionally provides asynchronous accesses to memory-mapped arguments, via separate address/data channels exposed as tapa::i/ostream. This makes it possible to achieve a high random access throughput.

TAPA also makes it possible to access the same memory-mapped interface from multiple task instances. This is implemented by instantiating AXI interconnect where an AXI interface is shared. Note that with great power comes great responsibility. The AXI interconnect do not provide built-in locking mechanisms. Programmers must take care of correctness with concurrent reads and writes.

Detached Tasks

Controlling tasks as OpenCL kernels can be expensive, yet often times each task instance is small and do not need runtime management. Intel FPGA SDK for OpenCL supports “autorun” kernels, which are started as soon as the accelerator starts and is restarted when it stops. However, they cannot have any arguments other than the global communication channels. Similarly, Xilinx Vitis supports “free-running” OpenCL kernels. While the state of these kernels do not have to be maintained at runtime and thus resources are saved, the degree of freedom is also greatly limited. #pragma HLS dataflow do not support “autorun” or “free-running” instances, therefore all task instances must be properly terminated. This can lead to unnecessary complexity to propagate termination signals.

TAPA makes it possible to “detach” a task instance once instantiated. By default, a TAPA task is not considered finished until all its children instances finish. Detach task instances resembles the concept of std::thread::detach(), and the parent task will not wait for it to finish. This makes it possible to do necessary initialization before entering an infinite loop in detached tasks, thus having both resource efficiency and programming flexibility.

Hardware FIFO

Vitis HLS can generate two different implementations for the communication channels, using shift-registers (SLR) and block RAMs (BRAM). However, it seems to unnecessarily prefer BRAM in many cases (e.g., when the FIFO is wide but shallow). TAPA uses a larger threshold that reduces BRAM usage. Moreover, when URAM is more efficient (width ≥ 36 and depth ≥ 4096), it is used in place of BRAM.

Timing Closure

TAPA natively supports AutoBridge. By specifying a filename for the output TCL constraints, TAPA can generate coarse-grained floorplans that were often necessary to achieve good timing closure.

Related Publications

Getting Started

Prerequisites

  • Ubuntu 16.04+
    • Coroutine-based simulator only works on Ubuntu 18.04+
  • Vitis HLS 2020.2+
  • Vivado

Install from Binary

./install.sh

Install from Source

Build Prerequisites

  • CMake 3.13+
  • A C++ 11 compiler (e.g. g++-9)
  • Python 3
  • Google glog library (libgoogle-glog-dev)
  • Clang 8 and its headers (clang-8, libclang-8-dev)
  • Boost coroutine library (libboost-coroutine-dev)
  • Icarus Verilog (iverilog)
  • FPGA Runtime

Build tapacc

mkdir build
cd build
cmake ..
make
make test
cd ..
sudo ln -s backend/python/tapac /usr/local/bin/
sudo ln -s build/backend/tapacc /usr/local/bin/

Known Issues

  • Template functions cannot be tasks
  • Vitis HLS include paths (e.g., /opt/Xilinx/Vitis_HLS/2020.2/include) must not be specified in tapac --cflags;
    • Workaround is to use CPATH environment variable (e.g., export CPATH=/opt/Xilinx/Vitis_HLS/2020.2/include)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tapa-0.0.20210801.1.tar.gz (45.0 kB view hashes)

Uploaded Source

Built Distributions

tapa-0.0.20210801.1-py3.6.egg (108.2 kB view hashes)

Uploaded Source

tapa-0.0.20210801.1-py3-none-any.whl (54.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page