Skip to main content

A Source Code Tokenizer

Project description

sctokenizer

A Source Code Tokenizer

Supports those languages: C, C++, Java, Python

How to install

pip install git+https://github.com/ngocjr7/sctokenizer

How to use

Use sctokenizer:

import sctokenizer

tokens = sctokenizer.tokenize_file(filepath='tests/data/hello_world.cpp', lang='cpp')
for token in tokens:
    print(token)

Or create new CppTokenizer:

from sctokenizer import CppTokenizer

tokenizer = CppTokenizer() # this object can be used for multiple source files
with open('tests/data/hello_world.cpp') as f:
    source = f.read()
    tokens = tokenizer.tokenize(source)
    for token in tokens:
        print(token)

Or better solution:

from sctokenizer import Source

src = Source.from_file('tests/data/hello_world.cpp', lang='cpp')
tokens = src.tokenize()
for token in tokens:
    print(token)

Results is a list of Token. Each Token has four attributes including token_value, token_type, line, column:

(#, TokenType.SPECIAL_SYMBOL, (1, 1))
(include, TokenType.KEYWORD, (1, 2))
(<, TokenType.OPERATOR, (1, 10))
(bits/stdc++.h, TokenType.IDENTIFIER, (1, 11))
(>, TokenType.OPERATOR, (1, 24))
(using, TokenType.KEYWORD, (3, 1))
(namespace, TokenType.KEYWORD, (3, 7))
(std, TokenType.IDENTIFIER, (3, 17))
(;, TokenType.SPECIAL_SYMBOL, (3, 20))
(int, TokenType.KEYWORD, (5, 1))
(main, TokenType.IDENTIFIER, (5, 5))
((, TokenType.SPECIAL_SYMBOL, (5, 9))
(), TokenType.SPECIAL_SYMBOL, (5, 10))
({, TokenType.SPECIAL_SYMBOL, (6, 1))
(cout, TokenType.IDENTIFIER, (7, 5))
(<<, TokenType.OPERATOR, (7, 11))
(", TokenType.SPECIAL_SYMBOL, (7, 13))
(Hello World, TokenType.STRING, (7, 14))
(", TokenType.SPECIAL_SYMBOL, (7, 25))
(;, TokenType.SPECIAL_SYMBOL, (7, 26))
(return, TokenType.KEYWORD, (8, 5))
(0, TokenType.CONSTANT, (8, 12))
(;, TokenType.SPECIAL_SYMBOL, (8, 13))
(}, TokenType.SPECIAL_SYMBOL, (9, 1))

TODO

  • Support other languages: PHP, Matlab, Javascript, Typescript,...
  • Auto detect language
  • Parse source to a tree of tokens???

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sctokenizer-0.0.1.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

sctokenizer-0.0.1-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file sctokenizer-0.0.1.tar.gz.

File metadata

  • Download URL: sctokenizer-0.0.1.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.6

File hashes

Hashes for sctokenizer-0.0.1.tar.gz
Algorithm Hash digest
SHA256 bc0589c52de80bfc96955340c735204ea70b14fa3981988a34ea675f640dac18
MD5 3bba0d8b71afdb498d4742b7cf146307
BLAKE2b-256 fe27d18e0ab26ac52787e67cb901c4a106d0901231853463bb09868637381344

See more details on using hashes here.

File details

Details for the file sctokenizer-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: sctokenizer-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.6

File hashes

Hashes for sctokenizer-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f8addf3b6426e70786f359b97cf55d7994ba58464a69ad2d006a70763a871fb0
MD5 de9fcc9ea8b6d2760a4587fabea5b965
BLAKE2b-256 1265aedf22856285a0d39cee90354baed77126e3d6c279e681c570ed7e96b6bc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page