Aligning BPE and AST
Project description
code_tokenizers
This library is built on top of the awesome transformers and tree-sitter libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser.
Install
pip install code_tokenizers
How to use
The main interface of code_tokenizers
is the
CodeTokenizer
class. You can use a pretrained BPE tokenizer from the popular
transformers
library, and a tree-sitter parser from the
tree-sitter
library.
To specify a
CodeTokenizer
using the gpt2
BPE tokenizer and the python
tree-sitter parser, you
can do:
from code_tokenizers.core import CodeTokenizer
py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You can specify any pretrained BPE tokenizer from the huggingface hub or a local directory and the language to parse the AST for.
Now, we can tokenize some code:
from pprint import pprint
code = """
def foo():
print("Hello world!")
"""
encoding = py_tokenizer(code)
pprint(encoding, depth=1)
{'ast_ids': [...],
'attention_mask': [...],
'input_ids': [...],
'is_builtins': [...],
'is_internal_methods': [...],
'merged_ast': [...],
'offset_mapping': [...],
'parent_ast_ids': [...]}
And we can print out the associated AST types:
Note
Note: Here the N/As are the tokens that are not part of the AST, such as the spaces and the newline characters. Their IDs are set to -1.
for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
if ast_id != -1:
print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
else:
print("N/A")
N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file code_tokenizers-0.0.5.tar.gz
.
File metadata
- Download URL: code_tokenizers-0.0.5.tar.gz
- Upload date:
- Size: 13.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 796b0dda0555bd5aea0a87643c9a062c444f57f61c92881a326a5242b5b2cdb4 |
|
MD5 | 006549670fe1286de9c944fa8f746085 |
|
BLAKE2b-256 | b8830d9323f6ea7fe953594392c12dd6c99231ff8e17c06f98853bcb66e7115b |
File details
Details for the file code_tokenizers-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: code_tokenizers-0.0.5-py3-none-any.whl
- Upload date:
- Size: 112.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4ce9108c840370cc8dd2582dc9a45451722806c9c8a3226f783870bbd4a9074 |
|
MD5 | b5be15d77a1bdf4376309d092e4f4d78 |
|
BLAKE2b-256 | 1863ae91f45b305d413edd5f196658142778c5117385d4b633cb40ef8411cf2b |