Skip to main content

Infrastructure to build LLVM IR-based Datasets.

Project description

LLVM-IR Dataset Utilities

This repository contains utilities to construct large LLVM IR datasets from multiple sources.

Getting Started

To get started with the dataset construction utilities, we'd suggest to use the packaged pipenv, or the packaged poetry to isolate the Python from your system isolation or other environments.

Pipenv

To get started with pipenv, you then have to

pipenv install

or if you seek to utilize the packaged lockfile

pipenv sync

After that you are ready to activate the environment, and install the dataset construction utilities into it

pipenv shell && pip install .

In case you want to develop the package, this becomes

pipenv shell && pip install -e .

Poetry

To get started with poetry, you then have to

poetry install

which will draw the exact software version from the packaged lockfile, and install the editable version of the dataset construction utilities into the environment. To only install the dependencies, you can run

poetry install --no-root

To then develop inside of poetry's virtual environment, we can launch a shell with

poetry shell

Creating First Data

To create your first small batch of IR data you then have to run from the root directory of the package

python3 ./llvm_ir_dataset_utils/tools/corpus_from_description.py \
  --source_dir=/path/to/store/dataset/to/source \
  --corpus_dir=/path/to/store/dataset/to/corpus \
  --build_dir=/path/to/store/dataset/to/build \
  --corpus_description=./corpus_descriptions_test/manual_tree.json

Beware! You'll need to have a version of llvm-objcopy on your $PATH. If you are missing llvm-objcopy, an easy way to obtain it is by downloading an llvm-release from either your preferred package channel such as apt, dnf or pacman, or build llvm from source where only the LLVM project itself needs to be enabled during the build, i.e. -DLLVM_ENABLE_PROJECTS="llvm".

You'll then receive a set of .bc files in /path/to/store/dataset/to/corpus/tree, which you can convert with llvm-dis into LLVM-IR, i.e. from inside of the folder

llvm-dis *.bc

Last steps into the dataloader to be described here.

Corpus Description

Basics of the corpus description to be outlined here to easily enable someone to point the package at a new source.

IR Sources

The package contains a number of builders to target the LLVM-based languages, and extract IR:

  • Individual projects (C/C++)
  • Rust crates
  • Spack packages
  • Autoconf
  • Cmake
  • Julia packages
  • Swift packages

Project details


Release history Release notifications | RSS feed

This version

0.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llvm_ir_dataset_utils-0.2.tar.gz (55.9 kB view details)

Uploaded Source

Built Distribution

llvm_ir_dataset_utils-0.2-py3-none-any.whl (88.0 kB view details)

Uploaded Python 3

File details

Details for the file llvm_ir_dataset_utils-0.2.tar.gz.

File metadata

  • Download URL: llvm_ir_dataset_utils-0.2.tar.gz
  • Upload date:
  • Size: 55.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.9.19 Linux/6.5.0-1018-azure

File hashes

Hashes for llvm_ir_dataset_utils-0.2.tar.gz
Algorithm Hash digest
SHA256 e8cbd3aaed052070cdaad70a2b5e8b885814585793a7c1783b3416fb168a5d5e
MD5 89871cda60cb0f73a0287f853b891159
BLAKE2b-256 5047bc60377c853d037b0f814937c3c29a5a79ec686cc4cdde459c09f5da7c16

See more details on using hashes here.

File details

Details for the file llvm_ir_dataset_utils-0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for llvm_ir_dataset_utils-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4195df3b304f0f80d505ec2dad74e86d67d70d1a8918669257c808f5b4d9690d
MD5 910777d4444161db1b091520e709b5ee
BLAKE2b-256 310f6895d74379f4c7845072696bcfa3b1ae1c5982f5d74bd7c63da2a04fc350

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page