Skip to main content

ASTAligner is designed to align tokens from source code snippets to Abstract Syntax Tree (AST) nodes using Tree-sitter for AST generation and various HuggingFace tokenizers for language tokenization. The library supports a wide range of programming languages and Fast tokenizers, enabling precise mapping between source code elements and their AST representations.

Project description

AST-Alignment Tool

Aligns the tokens from a code snippet to their corresponding nodes in an AST representation.

Description

A Large Language Model (LLM) is a type of AI model designed to understand and generate human-like text based on vast amounts of data. Trained on diverse source code datasets, LLMs can automate Software Engineering tasks across various contexts, such as code translation, code summarization, test-case generation, and code completion. A critical component of LLMs is the tokenizer, which breaks down text into smaller units, typically words or subwords, that the model can process. The tokenizer's role is essential because it converts source code into a format the model can understand, ensuring efficient and accurate code processing and generation. In the context of Interpretability for AI, post-hoc techniques such as ASTScore, rely on alienation functions (phi) to match the tokens generated by an LLM’s tokenizer with their corresponding nodes in the AST representation of a snippet. ASTAligner is designed to align tokens from source code snippets to Abstract Syntax Tree (AST) nodes using Tree-sitter for AST generation and various HuggingFace tokenizers for language tokenization. The library supports a wide range of programming languages and Fast tokenizers, enabling precise mapping between source code elements and their AST representations.

Goals

This project has two goals:

(1) Create a library for aligning the tokens from a code snippet to their corresponding nodes in the AST representation

(2) Create a tool to visualize the alignment of the tokens with their matching AST.

Additional Information

For more information regarding this project's background and dependencies, please refer to these readings:

(1) Evaluating and Explaining Large Language Models for Code Using Syntactic Structures

(2) Tree-Sitter Programming Language Parser

(3) Hugging Face Tokenizer

Installation

Use the package manager pip to install the ASTAligner package.

pip install ASTAligner

Supported Features

Library Usage

ASTalign

Using the ASTalign method asks that the user provide:

  • A snippet of code as a string, or a filepath to a text file containing code.
  • The language of the code snippet as one of the following strings:
    • python
    • c
    • cpp
    • csharp
    • java
    • javascript
    • ruby
    • html
    • go
    • kotlin
    • rust
    • haskell
  • A tokenizer specification as one of the following strings:
    • codellama
    • gpt2
    • bert-base-uncased
    • roberta-base
    • dialogpt
    • qwen
  • OR as an AutoTokenizer compliant name or path for access to tokenizers available through the Hugging Face model hub.
  • (Optional) A tokenizer object can be passed into the tokenizer field when using the library through PyPI. It is recommended that users utilize the preset tokenizer strings, as custom tokenizers are not guaranteed to work as intended.
  • (Optional) include_whitespace_and_special_tokens flag. Set to False by default, this flag allows the user to specify whether or not to show whitespaces and special characters in the tokens.

The method returns a dictionary of TSTree nodes to a list of tokens from the code snippet that overlap with those nodes. Example usage:

alignments[node] yields [tok1, tok2, ... , tokn]

printAlignmentsTree

The printAlignmentsTree method recursively prints out an entire tree with the provided node as the root of the tree. The method prints the type of each node and the tokens that are aligned to it. This method returns nothing.

Using the method asks that the user provide:

  • A node inside the tree (such as the root node)
  • The alignments object returned by ASTalign

Example usage:

test = r"""x = y + z"""
alignments = ASTalign(test, 'python', "bert-base-uncased")
root = getRootNode(alignments)
printAlignmentsTree(root, alignments)

Output:

-> 0  'module'
      ['x', '=', 'y', '+', 'z']

    -> 1  'expression_statement'
          ['x', '=', 'y', '+', 'z']

        -> 2  'assignment'
              ['x', '=', 'y', '+', 'z']

            -> 3  'identifier'
                  ['x']

            -> 3  '='
                  ['=']

            -> 3  'binary_operator'
                  ['y', '+', 'z']

                -> 4  'identifier'
                      ['y']

                -> 4  '+'
                      ['+']

                -> 4  'identifier'
                      ['z']

printAlignmentsNode

The printAlignmentsNode method prints out the type of the provided node and the tokens that are aligned to it. This method returns nothing.

Using the method asks that the user provide:

  • A node inside the tree (such as the root node)
  • The alignments object returned by the ASTalign method

Example Usage:

test = r"""x = y + z"""
alignments = ASTalign(test, 'python', "bert-base-uncased")
root = getRootNode(alignments)
printAlignmentsTree(root, alignments)

Output:

module
['x', '=', 'y', '+', 'z']

getRootNode

The getRootNode method returns the root node of the tree from the provided alignments object.

Using the method asks that the user provide:

  • An alignments object created by the ASTalign method

Example Usage:

test = r"""x = y + z"""
alignments = ASTalign(test, 'python', "bert-base-uncased")
root = getRootNode(alignments)

rangeFinder

The rangeFinder method returns an index range (start, end] for a TSTree node in a string.

Using the method asks that the user provide:

  • The range of a TSTree node.
  • A snippet of code as a string, or a filepath to a text file containing code.

Example usage:

If a node identifier_node corresponds to num in the code string snippet = "num = 1", then

rangeFinder(identifier_node.range, snippet)

yields tuple (0, 3).

ASTtokenFinder

The ASTtokenFinder method takes an index range in a string of code (as tuple), a code snippet, a language, and a tokenizer, and returns a dictionary mapping nodes whose text overlaps with the range to their tokens.

Note that the method constructs an alignments dictionary from the provided code before selecting the target nodes from the resulting alignments.

Using the method asks that the user provide:

  • An index range in a code string as (start, end].
  • A snippet of code as a string, or a filepath to a text file containing code.
  • The language of the code snippet as a string (see ASTalign section for language strings).
  • A tokenizer specification as a string (see ASTalign section for tokenizer strings).
  • (Optional) include_whitespace_and_special_tokens flag. Set to False by default, this flag allows the user to specify whether or not to show whitespaces and special characters in the tokens.
  • (Optional) use_fast flag. Set to True by default, this flag allows the user to specify whether or not to use the Fast (if available) or Slow implementation of the chosen tokenizer.

Example usage:

If a code string snippet = "num = 1" produces a tree of the form

| assignment_expr -> "num = 1"
--| identifier -> "num"
--| assignment_op -> "="
--| value -> "1"

then

ASTtokenFinder((0,3), snippet, language, tokenizer)

may yield

{assignment_exp : ['num', '=', '1'], identifier : ['num']}

as the text of the assignment_exp and identifier nodes overlap the range (0, 3] in the code string.

Contributing

Semeru Lab ASTAligner Team: Lillie Ayer, Cassie Baker, Daniel Biedron, Peter Buddendeck,Cristian Charette-Lopez,and Stephen Ramotowski

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astaligner-1.0.2.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ASTAligner-1.0.2-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file astaligner-1.0.2.tar.gz.

File metadata

  • Download URL: astaligner-1.0.2.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for astaligner-1.0.2.tar.gz
Algorithm Hash digest
SHA256 a516d00f811d9fc3421c1bbf7771ac1a8d708eef3262e90cf78b758c75751002
MD5 70692e5135679237fb5a6c8b06f45c71
BLAKE2b-256 b475da186f443b6d429d6c6e11c0710e63a01599444d83305d0a107170a5242d

See more details on using hashes here.

File details

Details for the file ASTAligner-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: ASTAligner-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for ASTAligner-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7fdeae0defc23c277fc3b9924ea6f2f89220a85a1b0a9994445a76392a4e0862
MD5 395a2073d16467b755b7a5ebb9bc3455
BLAKE2b-256 85192983a27d68c2efaa06678aa56a13f000fa68059e4fdfe4766e0c67a21078

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page