Skip to main content

A Source Code Tokenizer

Project description

sctokenizer

A Source Code Tokenizer

Supports those languages: C, C++, Java, Python, PHP

How to install

pip install sctokenizer

How to use

Use sctokenizer:

import sctokenizer

tokens = sctokenizer.tokenize_file(filepath='tests/data/hello_world.cpp', lang='cpp')
for token in tokens:
    print(token)

Or create new CppTokenizer:

from sctokenizer import CppTokenizer

tokenizer = CppTokenizer() # this object can be used for multiple source files
with open('tests/data/hello_world.cpp') as f:
    source = f.read()
    tokens = tokenizer.tokenize(source)
    for token in tokens:
        print(token)

Or better solution:

from sctokenizer import Source

src = Source.from_file('tests/data/hello_world.cpp', lang='cpp')
tokens = src.tokenize()
for token in tokens:
    print(token)

Result is a list of Token. Each Token has four attributes including token_value, token_type, line, column:

(#, TokenType.SPECIAL_SYMBOL, (1, 1))
(include, TokenType.KEYWORD, (1, 2))
(<, TokenType.OPERATOR, (1, 10))
(bits/stdc++.h, TokenType.IDENTIFIER, (1, 11))
(>, TokenType.OPERATOR, (1, 24))
(using, TokenType.KEYWORD, (3, 1))
(namespace, TokenType.KEYWORD, (3, 7))
(std, TokenType.IDENTIFIER, (3, 17))
(;, TokenType.SPECIAL_SYMBOL, (3, 20))
(int, TokenType.KEYWORD, (5, 1))
(main, TokenType.IDENTIFIER, (5, 5))
((, TokenType.SPECIAL_SYMBOL, (5, 9))
(), TokenType.SPECIAL_SYMBOL, (5, 10))
({, TokenType.SPECIAL_SYMBOL, (6, 1))
(cout, TokenType.IDENTIFIER, (7, 5))
(<<, TokenType.OPERATOR, (7, 11))
(", TokenType.SPECIAL_SYMBOL, (7, 13))
(Hello World, TokenType.STRING, (7, 14))
(", TokenType.SPECIAL_SYMBOL, (7, 25))
(;, TokenType.SPECIAL_SYMBOL, (7, 26))
(return, TokenType.KEYWORD, (8, 5))
(0, TokenType.CONSTANT, (8, 12))
(;, TokenType.SPECIAL_SYMBOL, (8, 13))
(}, TokenType.SPECIAL_SYMBOL, (9, 1))

TODO

  • Support other languages: Matlab, Javascript, Typescript,...
  • Auto detect language
  • Parse source to a tree of tokens???

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sctokenizer-0.0.5.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

sctokenizer-0.0.5-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file sctokenizer-0.0.5.tar.gz.

File metadata

  • Download URL: sctokenizer-0.0.5.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.6

File hashes

Hashes for sctokenizer-0.0.5.tar.gz
Algorithm Hash digest
SHA256 1bc967a58e9c55e2ad9b85e810af450ed0d2b9b04000a93d9a8faf228fbaa14d
MD5 40dcb3bc79c16ea07752005afa29f3ec
BLAKE2b-256 64f7448d4a1034582d5887c5f03c19b5ad02f29b90bcffc00bb954be546e87f3

See more details on using hashes here.

File details

Details for the file sctokenizer-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: sctokenizer-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.6

File hashes

Hashes for sctokenizer-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 165dd63c6e1a979d7b320ff94c4790a879680fd31f9ee55e0e16c9416d42fbea
MD5 da14cc35e525cf5f36a8d06da1a9256f
BLAKE2b-256 8a6a1051f25ec35a12f85b19b093dd2913198d2e674622fe0a37840977f6ec08

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page