A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations
Project description
ProGraML: Program Graphs for Machine Learning
License | |
OS | GNU/Linux, macOS ≥ 10.15 |
Python Versions | 3.6, 3.7, 3.8, 3.9 |
development Branch | |
stable Branch | |
Development Activity |
- Overview
- Getting Started
- Installation
- Datasets
- Constructing the ProGraML Representation
- Step 1: Compiler IR
- Step 2: Control-flow
- Step 3: Data-flow
- Step 4: Call graph
- Usage
- End-to-end C++ flow
- Dataflow experiments
- Contributing
- Acknowledgements
Overview
ProGraML is a representation for programs as input to a machine learning model.
Key features are:
- Expressiveness: We represent programs as graphs, capturing all of the control, data, and call relations. Each node in the graph represents an instruction, variable, or constant, and edges are positional such that non-commutative operations can be differentiated.
- Portability: ProGraML is derived from compiler IRs, making it independent of the source language (e.g. we have trained models to reason across five different source languages at a time). It is easy to target new IRs (we currently support LLVM and XLA).
- Extensibility: Features and labels can easily be added at the whole-program level, per-instruction level, or for individual relations.
Getting Started
To get stuck in and play around with our graph representation, visit:
Or if papers are more your ☕, have a read of ours:
Installation
Install the latest release of the Python package using:
pip install -U programl
See INSTALL.md for alternative installation options.
Datasets
Please see this doc for download links for our publicly available datasets of LLVM-IRs, ProGraML graphs, and data flow analysis labels.
Constructing the ProGraML Representation
The ProGraML representation is constructed in multiple stages. Here we describe the process for a simple recursive Fibonacci implementation in C. For instructions on how to run this process, see Usage below.
Step 1: Compiler IR
We start by lowering the program to a compiler IR. In this case, we'll use
LLVM-IR. This can be done using: clang -emit-llvm -S -O3 fib.c
.
Step 2: Control-flow
We begin building a graph by constructing a full-flow graph of the program. In a
full-flow graph, every instruction is a node and the edges are control-flow.
Note that edges are positional so that we can differentiate the branching
control flow in that switch
instruction.
Step 3: Data-flow
Then we add a graph node for every variable and constant. In the drawing above, the diamonds are constants and the variables are ovals. We add data-flow edges to describe the relations between constants and the instructions that use them, and variables and the constants which define/use them. Like control edges, data edges have positions. In the case of data edges, the position encodes the order of a data element in the list of instruction operands.
Step 4: Call graph
Finally, we add call edges (green) from callsites to the function entry
instruction, and return edges from function exits to the callsite. Since this is
a graph of a recursive function, the callsites refer back to the entry of the
function (the switch
). The external
node is used to represent a call from an
external site.
The process described above can be run locally using our
clang2graph
and
graph2dot
tools: clang clang2graph -O3 fib.c | graph2dot
Usage
End-to-end C++ flow
In the manner of Unix Zen, creating and manipulating ProGraML graphs is done using command-line tools which act as filters, reading in graphs from stdin and emitting graphs to stdout. The structure for graphs is described through a series of protocol buffers.
This section provides an example step-by-step guide for generating a program graph for a C++ application.
- Install LLVM-10 and the ProGraML command line tools.
- Compile your C++ code to LLVM-IR. The way to do this to modify your build
system so that clang is passed the
-emit-llvm -S
flags. For a single-source application, the command line invocation would be:
$ clang-10 -emit-llvm -S -c my_app.cpp -o my_app.ll
For a multi-source application, you can compile each file to LLVM-IR separately and then link the results. For example:
$ clang-10 -emit-llvm -S -c foo.cpp -o foo.ll
$ clang-10 -emit-llvm -S -c bar.cpp -o bar.ll
$ llvm-link foo.ll bar.ll -S -o my_app.ll
- Generate a ProGraML graph protocol buffer from the LLVM-IR using the llvm2graph commnand:
$ llvm2graph < my_app.ll > my_app.pbtxt
The generated file my_app.pbtxt
uses a human-readable
ProgramGraph
format which you can inspect using a text editor. In this case, we will
render it to an image file using Graphviz.
- Generate a Graphviz dotfile from the ProGraML graph using graph2dot:
$ graph2dot < my_app.pbtxt > my_app.dot
- Render the dotfile to a PNG image using Graphviz:
$ dot -Tpng my_app.dot -o my_app.png
Dataflow experiments
- Follow the instructions for building from source
- Download and unpack our dataflow dataset
- Train and evaluate a graph neural network model using:
bazel run -c opt //tasks/dataflow:train_ggnn -- \
--analysis reachability \
--path=$HOME/programl
where --analysis
is the name of the analysis you want to evaluate, and
--path
is the root of the unpacked dataset. There are a lot of options that
you can use to control the behavior of the experiment, see --helpfull
for a
full list. Some useful ones include:
--batch_size
controls the number of nodes in each batch of graphs.--layer_timesteps
defines the layers of the GGNN model, and the number of timesteps used for each.--learning_rate
sets the initial learning rate of the optimizer.--lr_decay_rate
the rate at which learning rate decays.--lr_decay_steps
number of gradient steps until the lr is decayed.--train_graph_counts
lists the number of graphs to train on between runs of the validation set.
🏗️ Under construction We are in the process of refactoring the dataflow experiments with a revamped API. There are currently bugs in the data loader which may affect training jobs, see #147.
Contributing
Patches, bug reports, feature requests are welcome! Please use the issue tracker to file a bug report or question. If you would like to help out with the code, please read this document.
Acknowledgements
Made with ❤️️ by Chris Cummins and Zach Fisches, with help from folks at the University of Edinburgh and ETH Zurich: Tal Ben-Nun, Torsten Hoefler, Hugh Leather, and Michael O'Boyle.
Funding sources: HiPEAC Travel Grant.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for programl-0.2.0-py3-none-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f8db1ddf11bd9e6cb7bc7d6bf477777f129e3fa0b9a2d45dedf4b9116d33e23 |
|
MD5 | cf5260fc2b8f7d3beb19673fc65d1717 |
|
BLAKE2b-256 | b9d15d7232bfbc25e4c88a60c190954c529cafa6dd89e7e93c0b070bbf02ab3f |
Hashes for programl-0.2.0-py3-none-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1c48828ab927e67ed41f28612a37fbef0488d2d1ea86560f2e0806a5d0140dd |
|
MD5 | c86a0b4511d0969df0efb9d39ace2a0c |
|
BLAKE2b-256 | 282bb70754444fdc56b1a5008e3a054828312d0ee605d57f3efb7f3a502f7054 |