Skip to main content

Generate combined multi-code view graphs

Project description

Tree Sitter Multi Codeview Generator

Tree Sitter Multi Codeview Generator aims to generate combined multi-code view graphs that can be used with various types of machine learning models (sequence model neural networks, graph neural networks, etc). It is also designed to be easily extended to various source code languages. tree-sitter is used for parsing which is highly efficient and has support for over 40+ languages. Currently, this repository supports codeviews for Java in over 40 possible combinations of codeviews. It has been structured such that support for other languages can be easily added. If you wish to add support for more languages, please refer to the contributing guide.

Comex

comex is a rebuild of Tree Sitter Multi Codeview Generator for easier invocation as a Python package. This rebuild also includes a cli interface for easier usage. It isolates the logic pertaining to the generation and combination of codeviews to better differentiate tasks involved in the IBM OSCP Project.

Installation

comex is published on the Python Registry and can be easily installed via pip:

pip install comex

Note: You would need to install GraphViz(dot) so that the graph visualizations are generated


To setup comex for development using the source code in your python environment:

pip install -r requirements-dev.txt

Note: Please clone recursively so sub-modules are setup correctly

git clone --recursive {...}

This performs an editable install, meaning that comex would be available throughout your environment (particularly relevant if you use conda or something of the sort). This means now you can interact and import from comex just like any other package while remaining standalone but also reflecting any code side updates without any other manual steps


Usage as a CLI

This is the recommended way to get started with comex as it is the most user friendly

The attributes and options supported by the CLI are well documented and can be viewed by running:

comex --help

For example, to generate a combined CFG and DFG graph for a java file, you can run:

comex --lang "java" --code-file ./test.java --graphs "cfg,dfg"

Usage as a Python Package

The comex package can be used by importing required drivers as follows:

from comex.codeviews.combined_graph.combined_driver import CombinedDriver

CombinedDriver(
    src_language=lang,
    src_code=code,
    output_file="output.json",
    graph_format=output,
    codeviews=codeviews
)

In most cases the required combination can be obtained via the combined_driver module as shown above.

src_language: denotes one of the supported languaged hence currently "java" or "cs"

src_code: denotes the source code to be parsed

output_file: denotes the output file to which the generated graph is written

graph_format: denotes the format of the output graph. Currently supported formats are "dot" and "json". To generate both pass "all"

codeviews: refers to the configuration passed for each codeview

Output Example:

Combined simple AST+CFG+DFG for a simple Java program that finds the maximum among 2 numbers:

Code Organization

The code is structured in the following way:

  1. For each code-view, first the source code is parsed using the tree-sitter parser and then the various code-views are generated. In the tree_parser directory, the Parser and ParserDriver is implemented with various funcitonalities commonly required by all code-views. Language-specific features are further developed in the language-specific parsers also placed in this directory.
  2. The codeviews directory contains the core logic for the various codeviews. Each codeview has a driver class and a codeview class, which is further inherited and extended by language in case of code-views that require language-specific implementation.
  3. The cli.py file is the CLI implementation. The drivers can also be directly imported and used like a python package. It is responsible for parsing the source code and generating the codeviews.

Testing

The repo is setup to automatically perform CI tests on making pulls to main and development branches. To test locally:

Run specific test

  • Say you wish to run test_cfg function
  • Drop the '[...]' part to run all tests in a file
    • formatted as [extension-filename]
  • no-cov prevents coverage report from being printed
pytest -k 'test_cfg[cs-test7]' --no-cov

Run all tests and get coverage report

pytest

Analyze the deviation report given by deepdiff by using the verbose output. This will help quickly figure out difference from the gold file

pytest -k 'test_cfg[cs-test7]' --no-cov -vv

Publishing

Make sure to bump the version in setup.cfg.

Then run the following commands:

rm -rf build dist
python setup.py sdist bdist_wheel

Then upload it to PyPI using twine (pip install twine if not installed):

twine upload dist/*

About the IBM OSCP Project

This tool was developed for research purposes as a part of the OSCP Project. Efficient representation of source code is essential for various software engineering tasks using AI pipelines such as code translation, code search and code clone detection. Code Representation aims at extracting the both syntactic and semantic features of source code and representing them by a vector which can be readily used for the downstream tasks. Multiple works exist that attempt to encode the code as sequential data to easily leverage state of art NN models like transformers. But it leads to a loss of information. Graphs are a natural representation for the code but very few works(MVG-AAAI’22) have tried to represent the different code features obtained from different code views like Program Dependency Graph, Data Flow Graph etc. as a multi-view graph. In this work, we want to explore more code views and its relevance to different code tasks as well as leverage transformers model for the multi-code view graphs. We believe such a work will help to

  1. Establish influence of specific code views for common tasks
  2. Demonstrate how graphs can combined with transformers
  3. Create re-usable models

Team

This tool is based on the ongoing joint research effort between IBM and Risha Lab at IIT Tirupati to explore the effects of different code representations on code based tasks involving:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comex-0.0.5.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

comex-0.0.5-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file comex-0.0.5.tar.gz.

File metadata

  • Download URL: comex-0.0.5.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.13

File hashes

Hashes for comex-0.0.5.tar.gz
Algorithm Hash digest
SHA256 8ee857641401d9ea032ab407b3456592f8903944b9bac1e99f706338de9c5845
MD5 9170f274c1252a6709ae9116b658c488
BLAKE2b-256 8e5d0628c04bd20595784c5fefbe8abc46566a8fe0eb7d0acaf56af81ab669cd

See more details on using hashes here.

File details

Details for the file comex-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: comex-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.13

File hashes

Hashes for comex-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 80ba5607c76fc110b684bed13bbc611f9ae848d274183ea9731a1f0b3e14acc4
MD5 afd1e18d96ed581775b08c37a57f26a0
BLAKE2b-256 4d480c51e644f7e3aac2ada063c625a41ba84e228fda4fcdcd19d69a489155f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page