Skip to main content

A Source Code Tokenizer

Project description

sctokenizer

A Source Code Tokenizer

Supports those languages: C, C++, Java, Python, PHP

How to install

pip install sctokenizer

How to use

Use sctokenizer:

import sctokenizer

tokens = sctokenizer.tokenize_file(filepath='tests/data/hello_world.cpp', lang='cpp')
for token in tokens:
    print(token)

Or create new CppTokenizer:

from sctokenizer import CppTokenizer

tokenizer = CppTokenizer() # this object can be used for multiple source files
with open('tests/data/hello_world.cpp') as f:
    source = f.read()
    tokens = tokenizer.tokenize(source)
    for token in tokens:
        print(token)

Or better solution:

from sctokenizer import Source

src = Source.from_file('tests/data/hello_world.cpp', lang='cpp')
tokens = src.tokenize()
for token in tokens:
    print(token)

Result is a list of Token. Each Token has four attributes including token_value, token_type, line, column:

(#, TokenType.SPECIAL_SYMBOL, (1, 1))
(include, TokenType.KEYWORD, (1, 2))
(<, TokenType.OPERATOR, (1, 10))
(bits/stdc++.h, TokenType.IDENTIFIER, (1, 11))
(>, TokenType.OPERATOR, (1, 24))
(using, TokenType.KEYWORD, (3, 1))
(namespace, TokenType.KEYWORD, (3, 7))
(std, TokenType.IDENTIFIER, (3, 17))
(;, TokenType.SPECIAL_SYMBOL, (3, 20))
(int, TokenType.KEYWORD, (5, 1))
(main, TokenType.IDENTIFIER, (5, 5))
((, TokenType.SPECIAL_SYMBOL, (5, 9))
(), TokenType.SPECIAL_SYMBOL, (5, 10))
({, TokenType.SPECIAL_SYMBOL, (6, 1))
(cout, TokenType.IDENTIFIER, (7, 5))
(<<, TokenType.OPERATOR, (7, 11))
(", TokenType.SPECIAL_SYMBOL, (7, 13))
(Hello World, TokenType.STRING, (7, 14))
(", TokenType.SPECIAL_SYMBOL, (7, 25))
(;, TokenType.SPECIAL_SYMBOL, (7, 26))
(return, TokenType.KEYWORD, (8, 5))
(0, TokenType.CONSTANT, (8, 12))
(;, TokenType.SPECIAL_SYMBOL, (8, 13))
(}, TokenType.SPECIAL_SYMBOL, (9, 1))

TODO

  • Support other languages: Matlab, Javascript, Typescript,...
  • Auto detect language
  • Parse source to a tree of tokens???

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sctokenizer-0.0.7.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

sctokenizer-0.0.7-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file sctokenizer-0.0.7.tar.gz.

File metadata

  • Download URL: sctokenizer-0.0.7.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.6

File hashes

Hashes for sctokenizer-0.0.7.tar.gz
Algorithm Hash digest
SHA256 56bca99eb272d0c584354b0fafdbb8184673f6c9df5c8a3518e5aa691ee58327
MD5 334b568cc8ff1ddfa8f62c47a1970467
BLAKE2b-256 cd4fcec50d441cec5fb250431ea77079ab923109818ae6d68c8570fbee20cc9c

See more details on using hashes here.

File details

Details for the file sctokenizer-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: sctokenizer-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 16.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.6

File hashes

Hashes for sctokenizer-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1bf0e5c37b51e70fdab28f06a3ca8563ea7776f230e8639c1e2a1d5a5437b9f8
MD5 6e075a99cf08e687b4a46365d6f579f3
BLAKE2b-256 9eb2c1f2c9baa220d8e837f7a355a16ab2e6775516f595f5962060ee1124eda4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page