Aligning BPE and AST

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

code_tokenizers

This library is built on top of the awesome transformers and tree-sitter libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser.

Install

pip install code_tokenizers

How to use

First you need to make sure you have the tree-sitter grammars for the languages you want to use. To simplify this process, we’ve built a CLI tool that will download the grammars for you that comes with this library:

!download_grammars --help

usage: download_grammars [-h] [--languages LANGUAGES [LANGUAGES ...]]

Download Tree-sitter grammars

options:
  -h, --help                            show this help message and exit
  --languages LANGUAGES [LANGUAGES ...]
                                        Languages to download (default: all)

This will download the grammars to the grammars directory in the directory where this library is installed. Let’s continue this example with the Python grammar:

!download_grammars --languages python

Now, we can create a CodeTokenizer object:

from code_tokenizers.core import CodeTokenizer

py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")

/home/nathan/miniconda3/envs/code_tokenizers/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

You can specify any pretrained BPE tokenizer from the huggingface hub or a local directory and the language to parse the AST for.

Now, we can tokenize some code:

from pprint import pprint

code = """
def foo():
    print("Hello world!")
"""

encoding = py_tokenizer(code)
pprint(encoding, depth=1)

{'ast_ids': [...],
 'attention_mask': [...],
 'input_ids': [...],
 'offset_mapping': [...],
 'parent_ast_ids': [...]}

And we can print out the associated AST types:

Note

Note: Here the N/As are the tokens that are not part of the AST, such as the spaces and the newline characters. Their IDs are set to -1.

for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
    if ast_id != -1:
        print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
    else:
        print("N/A")

N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.0.5

Feb 6, 2023

This version

0.0.4

Feb 5, 2023

0.0.3

Nov 28, 2022

0.0.2

Nov 22, 2022

0.0.1

Nov 17, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_tokenizers-0.0.4.tar.gz (57.6 kB view hashes)

Uploaded Feb 5, 2023 Source

Built Distribution

code_tokenizers-0.0.4-py3-none-any.whl (112.9 kB view hashes)

Uploaded Feb 5, 2023 Python 3

Hashes for code_tokenizers-0.0.4.tar.gz

Hashes for code_tokenizers-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`b95638811b18ff7c54679d2e6db90c2619986a9730b5e9d45235405a7096d4df`
MD5	`9ca0533c96948dce471bad337e1b724a`
BLAKE2b-256	`665c56630f3b64feec1c9eba99a400f0fe02156a0efc587579cf3ffe61c80c1c`

Hashes for code_tokenizers-0.0.4-py3-none-any.whl

Hashes for code_tokenizers-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f3419aba6811487a3df684eafd73db04cc5bfed5228de4a44699dfe432fed70b`
MD5	`90e333ff7e35cba68d353a62aa590c46`
BLAKE2b-256	`932c8572de49338504b91ac4019aeb7aee065966abf5a47c436afa382f51d745`