Skip to main content

LLVM IR based Program Embeddings for Compiler Optimizations and Program Comprehension

Project description

IR2Vec

IR2Vec is a LLVM IR based framework to generate distributed representations for the source code in an unsupervised manner, which can be used to represent programs as input to solve machine learning tasks that take programs as inputs.

This repo contains the source code and relevant information described in the paper (arXiv). Please see here for more details.

IR2Vec: LLVM IR Based Scalable Program Embeddings, S. VenkataKeerthy, Rohit Aggarwal, Shalini Jain, Maunendra Sankar Desarkar, Ramakrishna Upadrasta, and Y. N. Srikant

LLVM PyPI Version Tests Publish pre-commit checks

Image

LLVM Version Archive

LLVM Version Branch
LLVM 16.0.1 main
LLVM 14.0.1 llvm14
LLVM 12.0.0 llvm12
LLVM 10.0.1 llvm10
LLVM 8.0.1 llvm8

Table Of Contents

Installation

IR2Vec can be installed in different ways to accommodate individual preferences and requirements effectively. You may select to install via a user-friendly Python wheel setup if you are a Python user, or opt for a C++ based installation if you are looking to integrate with a compiler pass or necessitate advanced control and enhanced integration capabilities. The detailed setup steps are mentioned in the following sections.

Python

If you prefer working with Python, you can easily install IR2Vec using pip.

pip install -U ir2vec

Now, you can import and use IR2Vec in your Python projects. Make sure you have a good understanding of Python and its package management system.

We are actively working on improving the Python interfaces and providing better support. If you find any good-to-have interfaces that you may need for your use case missing, please feel free to raise a request.

Cpp

If you're a C++ developer and require low-level control, optimization, or integration with C++ projects, you can build IR2Vec from source. First, ensure the below requirements are satisfied, then follow the steps mentioned in the Building from source section.

Requirements

(Experiments are done on an Ubuntu 20.04 machine)

Building from source

  1. mkdir build && cd build
  2. IR2Vec uses Eigen library. If your system already have Eigen (3.3.7) setup, you can skip this step.
    1. Download and extract the released version.
      • wget https://gitlab.com/libeigen/eigen/-/archive/3.3.7/eigen-3.3.7.tar.gz
      • tar -xvzf eigen-3.3.7.tar.gz
    2. mkdir eigen-build && cd eigen-build
    3. cmake ../eigen-3.3.7 && make
    4. cd ../
  3. cmake -DLT_LLVM_INSTALL_DIR=<path_to_LLVM_build_dir> -DEigen3_DIR=<path_to_eigen_build_dir> [-DCMAKE_INSTALL_PREFIX=<install_dir>] ../src
  4. make [&& make install]

This process would generate ir2vec binary under build/bin directory, libIR2Vec.a and libIR2Vec.so under build/lib directory.

To ensure the correctness, run make verify-all

Generating program representations

IR2Vec can be used either as a stand-alone tool using binary or can be integrated with any third-party tools using libraries. Please see below for the usage instructions.

Using Binary

ir2vec -<mode> -vocab <seedEmbedding-file-path> -o <output-file> -level <p|f> -class <class-number> -funcName=<function-name> <input-ll-file>

Command-Line options

  • mode - can be one of sym/fa
    • sym denotes Symbolic representation
    • fa denotes Flow-Aware representation
  • vocab - the path to the seed embeddings file
  • o - file in which the embeddings are to be appended; (Note : If file doesn’t exist, new file would be created, else embeddings would be appended)
  • level - can be one of chars p/f.
    • p denotes program level encoding
    • f denotes function level encoding
  • class - non-mandatory argument. Used for the purpose of mentioning class labels for classification tasks (To be used with the level p). Defaults to -1. When, not equal to -1, the pass prints class-number followed by the corresponding embeddings
  • funcName - also a non-mandatory argument. Used for generating embeddings only for the functions with given name. level should be f while using this option

Please use --help for further details.

Format of the output embeddings in output_file

  • If the level is p:
<class-number> <Embeddings>

class-number would be printed only if it is not -1

  • If the level is f
<function-name> = <Embeddings>

Flow-Aware Embeddings

For all functions

  • ir2vec -fa -vocab vocabulary/seedEmbeddingVocab.txt -o <output_file> -level <p|f> -class <class-number> <input_ll_file>

For a specific function

  • ir2vec -fa -vocab vocabulary/seedEmbeddingVocab.txt -o <output_file> -level f -class <class-number> -funcName=\<function-name\><input_ll_file>

Symbolic Embeddings

For all functions

  • ir2vec -sym -vocab vocabulary/seedEmbeddingVocab.txt -o <output_file> -level <p|f> -class <class-number> <input_ll_file> For a specific function
  • ir2vec -sym -vocab vocabulary/seedEmbeddingVocab.txt -o <output_file> -level f -class <class-number> -funcName=\<function-name\> <input_ll_file>

Using Libraries

The libraries can be installed by passing the installation location to the CMAKE_INSTALL_PREFIX flag during cmake followed by make install. The interfaces are available in IR2Vec.h. External projects that would like to use IR2Vec can access the functionality using these exposed interfaces on including IR2Vec.h from the installed location after linking statically or dynamically.

  • If the project does not use LLVM, LLVM dependencies have to be linked and included separately.
  • Please ensure that the IR2Vec libraries are compiled with compatible LLVM.
    • If you are getting errors, please recompile IR2Vec by passing the current LLVM install directory path to LT_LLVM_INSTALL_DIR during cmake.

The following template can be used to link IR2vec libraries on a cmake based project.

set(IR2VEC_INSTALL_DIR "" CACHE PATH "IR2Vec installation directory")
include_directories("${IR2VEC_INSTALL_DIR}/include")
target_link_libraries(<your_executable_or_library> PUBLIC ${IR2VEC_INSTALL_DIR}/lib/<libIR2Vec.a or libIR2Vec.so>)

And then pass the location of IR2Vec's install prefix to DIR2VEC_INSTALL_DIR during cmake.

The following example snippet shows how to query the exposed vector representations.

#include "IR2Vec.h"

// Creating object to generate FlowAware representation
auto ir2vec =
      IR2Vec::Embeddings(<LLVM Module>, IR2Vec::IR2VecMode::FlowAware,
                         "./vocabulary/seedEmbeddingVocab.txt");

// Getting Instruction vectors corresponding to the instructions in <LLVM Module>
auto instVecMap = ir2vec.getInstVecMap();
// Access the generated vectors
for (auto instVec : instVecMap) {
    outs() << "Instruction : ";
    instVec.first->print(outs());
    outs() << ": ";

    for (auto val : instVec.second)
      outs() << val << "\t";
}

// Getting vectors corresponding to the functions in <LLVM Module>
auto funcVecMap = ir2vec.getFunctionVecMap();
// Access the generated vectors
for (auto funcVec : funcVecMap) {
    outs() << "Function : " << funcVec.first->getName() << "\n";
    for (auto val : funcVec.second)
      outs() << val << "\t";
  }

// Getting the program vector
auto pgmVec = ir2vec.getProgramVector();
// Access the generated vector
for (auto val : pgmVec)
    outs() << val << "\t";

Using Python package (IR2Vec-Wheels)

Initialization -ir2vec.initEmbedding

Description: Initialize IR2Vec embedding for an LLVM IR file.

Parameters:

  • file_path: str - Path to the .ll or .bc file.
  • encoding_type: str - Choose fa (Flow-Aware) or sym (Symbolic).
  • level: str - Choose p for program-level or f for function-level.

Returns:

  • IR2VecObject: Initialized object for accessing embeddings.

Example:

import ir2vec
import numpy as np

initObj = ir2vec.initEmbedding("/path/to/file.ll", "fa", "p")

getProgramVector

Description: Gets the program-level vector representation.

Parameters: optional

Returns:

  • progVector: ndarray - The program-level embedding vector.

Example:

# Getting the program-level vector
progVector = initObj.getProgramVector()

getFunctionVectors

Description: Gets function-level vectors for all functions in the LLVM IR file.

Parameters: optional

Returns:

  • functionVectorMap: dict - A dictionary where keys are function names and values are ndarrays containing function-level embedding vectors.

Example:

# Getting function-level vectors
functionVectorMap = initObj.getFunctionVectors()

getInstructionVectors

Description: Gets instruction-level vectors for all instructions in the LLVM IR file.

Parameters: optional

Returns:

  • instructionVectorsList: list - A list of list where each list contains instruction corresponding embedding vectors as values.

Example:

# Getting instruction-level vectors
instructionVectorsList = initObj.getInstructionVectors()

Example

  • The following code snippet contains an example to demonstrate the usage of the package.
import ir2vec
import numpy as np

# IR2Vec Python APIs can be used in two ways. As shown below.
initObj = ir2vec.initEmbedding("/path/to/file.ll", "fa", "p")

#Approach 1
progVector1 = ir2vec.getProgramVector(initObj)
functionVectorMap1 = ir2vec.getFunctionVectors(initObj)
instructionVectorsList1 = ir2vec.getInstructionVectors(initObj)

#Approach 2
progVector2 = initObj.getProgramVector()
functionVectorMap2 = initObj.getFunctionVectors()
instructionVectorsList2 = initObj.getInstructionVectors()

# Both the approaches would result in same outcomes
assert(np.allclose(progVector1,progVector2))

for fun, funcObj in functionVectorMap1.items():
    assert fun == funcObj["demangledName"]
    functionOutput1 = ir2vec.getFunctionVectors(
        initObj,
        funcObj["actualName"],
    )
    functionOutput2 = initObj.getFunctionVectors(
        funcObj["actualName"]
    )
    assert(np.allclose(functionOutput1[fun]["vector"],functionOutput2[fun]["vector"]))

Binaries, Libraries and Wheels - Artifacts

Binaries, Libraries (.a and .so), and whl files are autogenerated for every relevant check-in using GitHub Actions. Such generated artifacts are tagged along with the successful runs of Publish and Build Wheels actions.

Experiments

Note

The results mentioned in the experiment's scripts/the published version are not updated for this branch. The experimental results for this branch would be different when compared to the published version. For comparison, use the release corresponding to v0.1.0.

Citation

@article{VenkataKeerthy-2020-IR2Vec,
author = {VenkataKeerthy, S. and Aggarwal, Rohit and Jain, Shalini and Desarkar, Maunendra Sankar and Upadrasta, Ramakrishna and Srikant, Y. N.},
title = {{IR2Vec: LLVM IR Based Scalable Program Embeddings}},
year = {2020},
issue_date = {December 2020},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {17},
number = {4},
issn = {1544-3566},
url = {https://doi.org/10.1145/3418463},
doi = {10.1145/3418463},
journal = {ACM Trans. Archit. Code Optim.},
month = dec,
articleno = {32},
numpages = {27},
keywords = {heterogeneous systems, representation learning, compiler optimizations, LLVM, intermediate representations}
}

Contributions

Please feel free to raise issues to file a bug, pose a question, or initiate any related discussions. Pull requests are welcome :)

License

IR2Vec is released under a BSD 4-Clause License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

IR2Vec-2.1.1.tar.gz (189.8 kB view details)

Uploaded Source

Built Distributions

IR2Vec-2.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

IR2Vec-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

IR2Vec-2.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

IR2Vec-2.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

IR2Vec-2.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

IR2Vec-2.1.1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

File details

Details for the file IR2Vec-2.1.1.tar.gz.

File metadata

  • Download URL: IR2Vec-2.1.1.tar.gz
  • Upload date:
  • Size: 189.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.1 CPython/3.11.7

File hashes

Hashes for IR2Vec-2.1.1.tar.gz
Algorithm Hash digest
SHA256 e68e45b1a78345c310e02ad13f0443af95c763987cea83980f56d87b1ce84717
MD5 e937cdf21ad333f06ec3a2d2e18cfa9c
BLAKE2b-256 58f5540584a968aa7b6bd239a86ee8e2b9c83637bf7b47a7ffc330e3025250b3

See more details on using hashes here.

File details

Details for the file IR2Vec-2.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for IR2Vec-2.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 db68a3196c666476f1fa2dc83a0bca5b9695ae0b0745ae8df3cd4914cf51f2cb
MD5 0396f499608bb6dc5fdc064d86d1d480
BLAKE2b-256 c83f331c70bf00d9ae0496ae3859932f591acc31eed22894bf08b807bfaf0ab0

See more details on using hashes here.

File details

Details for the file IR2Vec-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for IR2Vec-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 94f5f4bc8c76f33841f749994fa783d865431774a8293159d792e7c222e1f3ab
MD5 3f839aed9734d1d6e34f383bdde217b8
BLAKE2b-256 73a8ffb991c9f1dc4c95c5797309e029bd1b80162c00ec8911c07dbc3cebcf67

See more details on using hashes here.

File details

Details for the file IR2Vec-2.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for IR2Vec-2.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dd49c117383a5a24f4351d1557b2226ca137481ce83f3f7c3b5d7e11f7169ab7
MD5 4052c6020caceeb264d1539d511fd950
BLAKE2b-256 92c9f6ab8932b89d8c02f7891fe2eb9544d7c3e34d56ed68ccfd4104436549f7

See more details on using hashes here.

File details

Details for the file IR2Vec-2.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for IR2Vec-2.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3339e862d2cd194ed5ceaa992858c90f237da68ddf148fdbde1c00fcce0ee042
MD5 512ef5829218f2d81f3e92139fab7cd3
BLAKE2b-256 7df952a8c2e247f076ed6767e7f8b98a5792209f62f70fa130a311d614fb2a54

See more details on using hashes here.

File details

Details for the file IR2Vec-2.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for IR2Vec-2.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 42965d8bb6cf4b52f474dcf8bc6d4d90a200bee862645c8de23e53991535d7be
MD5 ef31464eb556c82eaf987250dca13d14
BLAKE2b-256 b275cfb7203fd02fa2959c89750346049b42cc0f0e07f4f7f152742f8161a216

See more details on using hashes here.

File details

Details for the file IR2Vec-2.1.1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for IR2Vec-2.1.1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 86051d3b26f4a66e019149689e80f380e6942b01bbbbbd1b62e7dbd01d6bead7
MD5 6ee875feb0f6b6602b3a7ee6c782c28e
BLAKE2b-256 7a06453d3a3d6af957960ec7d30f63150f78ba9b42ffdf8bb9a7ec01ba60a013

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page