Skip to main content

A Source Code Tokenizer

Project description

sctokenizer

A Source Code Tokenizer

Supports those languages: C, C++, Java, Python, PHP

How to install

pip install sctokenizer

How to use

Use sctokenizer:

import sctokenizer

tokens = sctokenizer.tokenize_file(filepath='tests/data/hello_world.cpp', lang='cpp')
for token in tokens:
    print(token)

Or create new CppTokenizer:

from sctokenizer import CppTokenizer

tokenizer = CppTokenizer() # this object can be used for multiple source files
with open('tests/data/hello_world.cpp') as f:
    source = f.read()
    tokens = tokenizer.tokenize(source)
    for token in tokens:
        print(token)

Or better solution:

from sctokenizer import Source

src = Source.from_file('tests/data/hello_world.cpp', lang='cpp')
tokens = src.tokenize()
for token in tokens:
    print(token)

Result is a list of Token. Each Token has four attributes including token_value, token_type, line, column:

(#, TokenType.SPECIAL_SYMBOL, (1, 1))
(include, TokenType.KEYWORD, (1, 2))
(<, TokenType.OPERATOR, (1, 10))
(bits/stdc++.h, TokenType.IDENTIFIER, (1, 11))
(>, TokenType.OPERATOR, (1, 24))
(using, TokenType.KEYWORD, (3, 1))
(namespace, TokenType.KEYWORD, (3, 7))
(std, TokenType.IDENTIFIER, (3, 17))
(;, TokenType.SPECIAL_SYMBOL, (3, 20))
(int, TokenType.KEYWORD, (5, 1))
(main, TokenType.IDENTIFIER, (5, 5))
((, TokenType.SPECIAL_SYMBOL, (5, 9))
(), TokenType.SPECIAL_SYMBOL, (5, 10))
({, TokenType.SPECIAL_SYMBOL, (6, 1))
(cout, TokenType.IDENTIFIER, (7, 5))
(<<, TokenType.OPERATOR, (7, 11))
(", TokenType.SPECIAL_SYMBOL, (7, 13))
(Hello World, TokenType.STRING, (7, 14))
(", TokenType.SPECIAL_SYMBOL, (7, 25))
(;, TokenType.SPECIAL_SYMBOL, (7, 26))
(return, TokenType.KEYWORD, (8, 5))
(0, TokenType.CONSTANT, (8, 12))
(;, TokenType.SPECIAL_SYMBOL, (8, 13))
(}, TokenType.SPECIAL_SYMBOL, (9, 1))

TODO

  • Support other languages: Matlab, Javascript, Typescript,...
  • Auto detect language
  • Parse source to a tree of tokens???

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sctokenizer-0.0.6.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

sctokenizer-0.0.6-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file sctokenizer-0.0.6.tar.gz.

File metadata

  • Download URL: sctokenizer-0.0.6.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5

File hashes

Hashes for sctokenizer-0.0.6.tar.gz
Algorithm Hash digest
SHA256 b61d1ad0b9bf8eb3ababd41608144f0befab067fde9089906b5f002f40e5fc71
MD5 8ab7b659a9ea0f9a928184361bf02947
BLAKE2b-256 b204cbb4b4cdab0ed21c7c608bf13d1caeb2a4b40cab88aa9e5f192f7385c7e2

See more details on using hashes here.

File details

Details for the file sctokenizer-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: sctokenizer-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.5

File hashes

Hashes for sctokenizer-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 518b5de51d8de533184ef146d8993b0ae1789b312696496a2f4068cc89682122
MD5 a73149bcc4cde217af4a1a651243a709
BLAKE2b-256 f567890325e92e85fc4ee90b00726824c41179b9ddd43fa79c867cc3ced50077

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page