Skip to main content

Aligning BPE and AST

Project description

code_tokenizers

This library is built on top of the awesome transformers and tree-sitter libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser.

Install

pip install code_tokenizers

How to use

The main interface of code_tokenizers is the CodeTokenizer class. You can use a pretrained BPE tokenizer from the popular transformers library, and a tree-sitter parser from the tree-sitter library.

To specify a CodeTokenizer using the gpt2 BPE tokenizer and the python tree-sitter parser, you can do:

from code_tokenizers.core import CodeTokenizer

py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

You can specify any pretrained BPE tokenizer from the huggingface hub or a local directory and the language to parse the AST for.

Now, we can tokenize some code:

from pprint import pprint

code = """
def foo():
    print("Hello world!")
"""

encoding = py_tokenizer(code)
pprint(encoding, depth=1)
{'ast_ids': [...],
 'attention_mask': [...],
 'input_ids': [...],
 'is_builtins': [...],
 'is_internal_methods': [...],
 'merged_ast': [...],
 'offset_mapping': [...],
 'parent_ast_ids': [...]}

And we can print out the associated AST types:

Note

Note: Here the N/As are the tokens that are not part of the AST, such as the spaces and the newline characters. Their IDs are set to -1.

for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
    if ast_id != -1:
        print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
    else:
        print("N/A")
N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_tokenizers-0.0.5.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

code_tokenizers-0.0.5-py3-none-any.whl (112.6 kB view details)

Uploaded Python 3

File details

Details for the file code_tokenizers-0.0.5.tar.gz.

File metadata

  • Download URL: code_tokenizers-0.0.5.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for code_tokenizers-0.0.5.tar.gz
Algorithm Hash digest
SHA256 796b0dda0555bd5aea0a87643c9a062c444f57f61c92881a326a5242b5b2cdb4
MD5 006549670fe1286de9c944fa8f746085
BLAKE2b-256 b8830d9323f6ea7fe953594392c12dd6c99231ff8e17c06f98853bcb66e7115b

See more details on using hashes here.

File details

Details for the file code_tokenizers-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for code_tokenizers-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b4ce9108c840370cc8dd2582dc9a45451722806c9c8a3226f783870bbd4a9074
MD5 b5be15d77a1bdf4376309d092e4f4d78
BLAKE2b-256 1863ae91f45b305d413edd5f196658142778c5117385d4b633cb40ef8411cf2b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page