Aligning BPE and AST
Project description
code_tokenizers
This library is built on top of the awesome transformers and tree-sitter libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser.
Install
pip install code_tokenizers
How to use
The main interface of code_tokenizers
is the
CodeTokenizer
class. You can use a pretrained BPE tokenizer from the popular
transformers
library, and a tree-sitter parser from the
tree-sitter
library.
To specify a
CodeTokenizer
using the gpt2
BPE tokenizer and the python
tree-sitter parser, you
can do:
from code_tokenizers.core import CodeTokenizer
py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You can specify any pretrained BPE tokenizer from the huggingface hub or a local directory and the language to parse the AST for.
Now, we can tokenize some code:
from pprint import pprint
code = """
def foo():
print("Hello world!")
"""
encoding = py_tokenizer(code)
pprint(encoding, depth=1)
{'ast_ids': [...],
'attention_mask': [...],
'input_ids': [...],
'is_builtins': [...],
'is_internal_methods': [...],
'merged_ast': [...],
'offset_mapping': [...],
'parent_ast_ids': [...]}
And we can print out the associated AST types:
Note
Note: Here the N/As are the tokens that are not part of the AST, such as the spaces and the newline characters. Their IDs are set to -1.
for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
if ast_id != -1:
print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
else:
print("N/A")
N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for code_tokenizers-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4ce9108c840370cc8dd2582dc9a45451722806c9c8a3226f783870bbd4a9074 |
|
MD5 | b5be15d77a1bdf4376309d092e4f4d78 |
|
BLAKE2b-256 | 1863ae91f45b305d413edd5f196658142778c5117385d4b633cb40ef8411cf2b |