Skip to main content

Aligning BPE and AST

Project description

code_tokenizers

This library is built on top of the awesome transformers and tree-sitter libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser.

Install

pip install code_tokenizers

How to use

First you need to make sure you have the tree-sitter grammars for the languages you want to use. To simplify this process, we’ve built a CLI tool that will download the grammars for you that comes with this library:

!download_grammars --help
usage: download_grammars [-h] [--languages LANGUAGES [LANGUAGES ...]]

Download Tree-sitter grammars

options:
  -h, --help                            show this help message and exit
  --languages LANGUAGES [LANGUAGES ...]
                                        Languages to download (default: all)

This will download the grammars to the grammars directory in the directory where this library is installed. Let’s continue this example with the Python grammar:

!download_grammars --languages python

Now, we can create a CodeTokenizer object:

from code_tokenizers.core import CodeTokenizer

py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")
/home/nathan/miniconda3/envs/code_tokenizers/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

You can specify any pretrained BPE tokenizer from the huggingface hub or a local directory and the language to parse the AST for.

Now, we can tokenize some code:

from pprint import pprint

code = """
def foo():
    print("Hello world!")
"""

encoding = py_tokenizer(code)
pprint(encoding, depth=1)
{'ast_ids': [...],
 'attention_mask': [...],
 'input_ids': [...],
 'offset_mapping': [...],
 'parent_ast_ids': [...]}

And we can print out the associated AST types:

Note

Note: Here the N/As are the tokens that are not part of the AST, such as the spaces and the newline characters. Their IDs are set to -1.

for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
    if ast_id != -1:
        print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
    else:
        print("N/A")
N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_tokenizers-0.0.4.tar.gz (57.6 kB view hashes)

Uploaded Source

Built Distribution

code_tokenizers-0.0.4-py3-none-any.whl (112.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page