code-tokenize

Fast program tokenization and structural analysis in Python

These details have not been verified by PyPI

Project links

Project description

Fast tokenization and structural analysis of any programming language in Python

Programming Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages. To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programming languages. Especially the syntactical structure can be exploited to gain knowledge about programs.

code.tokenize provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing. By relating each token to an AST node, it is possible to extend the program representation easily with further syntactic information.

Installation

The package is tested under Python 3. It can be installed via:

pip install code-tokenize

Usage

code.tokenize can tokenize nearly any program code in a few lines of code:

import code_tokenize as ctok

# Python
ctok.tokenize(
    '''
        def my_func():
            print("Hello World")
    ''',
lang = "python")

# Output: [def, my_func, (, ), :, #NEWLINE#, ...]

# Java
ctok.tokenize(
    '''
        public static void main(String[] args){
          System.out.println("Hello World");
        }
    ''',
lang = "java", 
syntax_error = "ignore")

# Output: [public, static, void, main, (, String, [, ], args), {, System, ...]

# JavaScript
ctok.tokenize(
    '''
        alert("Hello World");
    ''',
lang = "javascript", 
syntax_error = "ignore")

# Output: [alert, (, "Hello World", ), ;]

Supported languages

code.tokenize employs tree-sitter as a backend. Therefore, in principal, any language supported by tree-sitter is also supported by a tokenizer in code.tokenize.

For some languages, this library supports additional features that are not directly supported by tree-sitter. Therefore, we distinguish between three language classes and support the following language identifier:

native: python
advanced: java
basic: javascript, go, ruby, cpp, c, swift, rust, ...

Languages in the native class support all features of this library and are extensively tested. advanced languages are tested but do not support the full feature set. Languages of the basic class are not tested and only support the feature set of the backend. They can still be used for tokenization and AST parsing.

How to contribute

Your language is not natively supported by code.tokenize or the tokenization seems to be incorrect? Then change it!

While code.tokenize is developed mainly as an helper library for internal research projects, we welcome pull requests of any sorts (if it is a new feature or a bug fix).

Want to help to test more languages? Our goal is to support as many languages as possible at a native level. However, languages on basic level are completly untested. You can help by testing basic languages and reporting issues in the tokenization process!

Release history

0.2.0
- Major API redesign!
- CHANGE: AST parsing is now done by an external library: code_ast
- CHANGE: Visitor pattern instead of custom tokenizer
- CHANGE: Custom visitors for language dependent tokenization
0.1.0
- The first proper release
- CHANGE: Language specific tokenizer configuration
- CHANGE: Basic analyses of the program structure and token role
- CHANGE: Documentation
0.0.1
- Work in progress

Project Info

The goal of this project is to provide developer in the programming language processing community with easy access to program tokenization and AST parsing. This is currently developed as a helper library for internal research projects. Therefore, it will only be updated as needed.

Feel free to open an issue if anything unexpected happens.

Distributed under the MIT license. See LICENSE for more information.

This project was developed as part of our research related to:

@inproceedings{richter2022tssb,
  title={TSSB-3M: Mining single statement bugs at massive scale},
  author={Cedric Richter, Heike Wehrheim},
  booktitle={MSR},
  year={2022}
}

We thank the developer of tree-sitter library. Without tree-sitter this project would not be possible.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Jan 14, 2025

0.2.0

Jun 28, 2022

0.1.0

Jan 19, 2022

0.0.1.post1

Nov 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_tokenize-0.2.1.tar.gz (13.9 kB view details)

Uploaded Jan 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

code_tokenize-0.2.1-py3-none-any.whl (17.4 kB view details)

Uploaded Jan 14, 2025 Python 3

File details

Details for the file code_tokenize-0.2.1.tar.gz.

File metadata

Download URL: code_tokenize-0.2.1.tar.gz
Upload date: Jan 14, 2025
Size: 13.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.14

File hashes

Hashes for code_tokenize-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`a563f6adade6b61a5dc82364fa0c71d86abefae40715bab249e6b30838c00c7d`
MD5	`49a58b585d6608380bbff468b9c1e3aa`
BLAKE2b-256	`69613d77da992c21126551d1d84e9525b14d6f4daf62a2e8b66cbb4d48a344f9`

See more details on using hashes here.

File details

Details for the file code_tokenize-0.2.1-py3-none-any.whl.

File metadata

Download URL: code_tokenize-0.2.1-py3-none-any.whl
Upload date: Jan 14, 2025
Size: 17.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.14

File hashes

Hashes for code_tokenize-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6ce40a6dbca0f729ad59269ca15c807cee22a769f70d664d66bac12f5c4f48f`
MD5	`2a14f9e41901730206e389d60bc7bf56`
BLAKE2b-256	`b901004614eca501f31f4bc33547e142cd51e0e1e84955694b2e043d025ccc27`

See more details on using hashes here.

code-tokenize 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage

Supported languages

How to contribute

Release history

Project Info

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes