Project description

code_tokenizers

This library is built on top of the awesome transformers and tree-sitter libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser.

Install

pip install code_tokenizers

How to use

The main interface of code_tokenizers is the CodeTokenizer class. You can use a pretrained BPE tokenizer from the popular transformers library, and a tree-sitter parser from the tree-sitter library.

To specify a CodeTokenizer using the gpt2 BPE tokenizer and the python tree-sitter parser, you can do:

from code_tokenizers.core import CodeTokenizer

py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

You can specify any pretrained BPE tokenizer from the huggingface hub or a local directory and the language to parse the AST for.

Now, we can tokenize some code:

from pprint import pprint

code = """
def foo():
    print("Hello world!")
"""

encoding = py_tokenizer(code)
pprint(encoding, depth=1)

{'ast_ids': [...],
 'attention_mask': [...],
 'input_ids': [...],
 'is_builtins': [...],
 'is_internal_methods': [...],
 'merged_ast': [...],
 'offset_mapping': [...],
 'parent_ast_ids': [...]}

And we can print out the associated AST types:

Note

Note: Here the N/As are the tokens that are not part of the AST, such as the spaces and the newline characters. Their IDs are set to -1.

for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
    if ast_id != -1:
        print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
    else:
        print("N/A")

N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.0.5

Feb 6, 2023

0.0.4

Feb 5, 2023

0.0.3

Nov 28, 2022

0.0.2

Nov 22, 2022

0.0.1

Nov 17, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_tokenizers-0.0.5.tar.gz (13.4 kB view hashes)

Uploaded Feb 6, 2023 Source

Built Distribution

code_tokenizers-0.0.5-py3-none-any.whl (112.6 kB view hashes)

Uploaded Feb 6, 2023 Python 3

Hashes for code_tokenizers-0.0.5.tar.gz

Hashes for code_tokenizers-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`796b0dda0555bd5aea0a87643c9a062c444f57f61c92881a326a5242b5b2cdb4`
MD5	`006549670fe1286de9c944fa8f746085`
BLAKE2b-256	`b8830d9323f6ea7fe953594392c12dd6c99231ff8e17c06f98853bcb66e7115b`

Hashes for code_tokenizers-0.0.5-py3-none-any.whl

Hashes for code_tokenizers-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4ce9108c840370cc8dd2582dc9a45451722806c9c8a3226f783870bbd4a9074`
MD5	`b5be15d77a1bdf4376309d092e4f4d78`
BLAKE2b-256	`1863ae91f45b305d413edd5f196658142778c5117385d4b633cb40ef8411cf2b`