ASTAligner is designed to align tokens from source code snippets to Abstract Syntax Tree (AST) nodes using Tree-sitter for AST generation and various HuggingFace tokenizers for language tokenization. The library supports a wide range of programming languages and Fast tokenizers, enabling precise mapping between source code elements and their AST representations.
Project description
AST-Alignment Tool
Aligns the tokens from a code snippet to their corresponding nodes in an AST representation.
Description
A Large Language Model (LLM) is a type of AI model designed to understand and generate human-like text based on vast amounts of data. Trained on diverse source code datasets, LLMs can automate Software Engineering tasks across various contexts, such as code translation, code summarization, test-case generation, and code completion. A critical component of LLMs is the tokenizer, which breaks down text into smaller units, typically words or subwords, that the model can process. The tokenizer's role is essential because it converts source code into a format the model can understand, ensuring efficient and accurate code processing and generation. In the context of Interpretability for AI, post-hoc techniques such as ASTScore, rely on alienation functions (phi) to match the tokens generated by an LLM’s tokenizer with their corresponding nodes in the AST representation of a snippet.
Goals
This project has two goals:
(1) Create a library for aligning the tokens from a code snippet to their corresponding nodes in the AST representation
(2) Create a tool to visualize the alignment of the tokens with their matching AST.
Additional Information
For more information regarding this project's background and dependencies, please refer to these readings:
(1) Evaluating and Explaining Large Language Models for Code Using Syntactic Structures
(2) Tree-Sitter Programming Language Parser
Installation
Use the package manager pip to install all backend dependencies needed for the AST-Alignment Tool. All required packages for the backend can be downloaded using requirements.txt, which can found in the base repository.
pip install -r /path/to/requirements.txt
Supported Features
- 11 supported languages
- Python
- C
- C++
- C#
- Java
- JavaScript
- Ruby
- HTML
- GO
- Kotlin
- Rust
- 6 Tokenizers
Library Usage
EXPLAIN HOW TO USE THE PYTHON LIBRARY
Contributing
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file astaligner-0.1.0.tar.gz.
File metadata
- Download URL: astaligner-0.1.0.tar.gz
- Upload date:
- Size: 8.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c172a17c908f23b1a3e52e0ad312c99c6e50f546eb8d2c289be401d21393ccb
|
|
| MD5 |
3c6ce79497f9e2523f72438e8a065808
|
|
| BLAKE2b-256 |
24c6e6060f9a9c0560ce45620a19c25d5528d374af6a7b8d810a61a6de6f0e74
|
File details
Details for the file ASTAligner-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ASTAligner-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca9ba1ed6524a8444aab206ea1f2913fc0d4369e4efb67f7f1bc6ff71bb4a702
|
|
| MD5 |
0de24ca03b0c0f4a601a791a97358e97
|
|
| BLAKE2b-256 |
e02a17880146d91129dc79211aa7fc5531f4e655254b32d89b516da240b8f3ce
|