Aligning BPE and AST
Project description
code_tokenizers
This library is built on top of the awesome transformers and tree-sitter libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser.
Install
pip install code_tokenizers
How to use
First you need to make sure you have the tree-sitter grammars for the languages you want to use. To simplify this process, we’ve built a CLI tool that will download the grammars for you that comes with this library:
!download_grammars --help
usage: download_grammars [-h] [--languages LANGUAGES [LANGUAGES ...]]
Download Tree-sitter grammars
options:
-h, --help show this help message and exit
--languages LANGUAGES [LANGUAGES ...]
Languages to download (default: all)
This will download the grammars to the grammars
directory in the
directory where this library is installed. Let’s continue this example
with the Python grammar:
!download_grammars --languages python
Now, we can create a
CodeTokenizer
object:
from code_tokenizers.core import CodeTokenizer
py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")
/home/nathan/miniconda3/envs/code_tokenizers/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You can specify any pretrained BPE tokenizer from the huggingface hub or a local directory and the language to parse the AST for.
Now, we can tokenize some code:
from pprint import pprint
code = """
def foo():
print("Hello world!")
"""
encoding = py_tokenizer(code)
pprint(encoding, depth=1)
{'ast_ids': [...],
'attention_mask': [...],
'input_ids': [...],
'offset_mapping': [...],
'parent_ast_ids': [...]}
And we can print out the associated AST types:
Note
Note: Here the N/As are the tokens that are not part of the AST, such as the spaces and the newline characters. Their IDs are set to -1.
for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
if ast_id != -1:
print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
else:
print("N/A")
N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for code_tokenizers-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52594f354d7c611ecaed1206a192633dfdd2ab138e10d77841f524cefff929fa |
|
MD5 | 1b65f8bf27fd3dae94a4f4d196142f92 |
|
BLAKE2b-256 | aaa10f3cd52e07abf0abb312f3990aa8bd1de5f48dd0ef3cdfc6f09f70940a01 |