Skip to main content

A Source Code Tokenizer

Project description

sctokenizer

A Source Code Tokenizer

Supports those languages: C, C++, Java, Python, PHP

How to install

pip install sctokenizer

How to use

Use sctokenizer:

import sctokenizer

tokens = sctokenizer.tokenize_file(filepath='tests/data/hello_world.cpp', lang='cpp')
for token in tokens:
    print(token)

Or create new CppTokenizer:

from sctokenizer import CppTokenizer

tokenizer = CppTokenizer() # this object can be used for multiple source files
with open('tests/data/hello_world.cpp') as f:
    source = f.read()
    tokens = tokenizer.tokenize(source)
    for token in tokens:
        print(token)

Or better solution:

from sctokenizer import Source

src = Source.from_file('tests/data/hello_world.cpp', lang='cpp')
tokens = src.tokenize()
for token in tokens:
    print(token)

Result is a list of Token. Each Token has four attributes including token_value, token_type, line, column:

(#, TokenType.SPECIAL_SYMBOL, (1, 1))
(include, TokenType.KEYWORD, (1, 2))
(<, TokenType.OPERATOR, (1, 10))
(bits/stdc++.h, TokenType.IDENTIFIER, (1, 11))
(>, TokenType.OPERATOR, (1, 24))
(using, TokenType.KEYWORD, (3, 1))
(namespace, TokenType.KEYWORD, (3, 7))
(std, TokenType.IDENTIFIER, (3, 17))
(;, TokenType.SPECIAL_SYMBOL, (3, 20))
(int, TokenType.KEYWORD, (5, 1))
(main, TokenType.IDENTIFIER, (5, 5))
((, TokenType.SPECIAL_SYMBOL, (5, 9))
(), TokenType.SPECIAL_SYMBOL, (5, 10))
({, TokenType.SPECIAL_SYMBOL, (6, 1))
(cout, TokenType.IDENTIFIER, (7, 5))
(<<, TokenType.OPERATOR, (7, 11))
(", TokenType.SPECIAL_SYMBOL, (7, 13))
(Hello World, TokenType.STRING, (7, 14))
(", TokenType.SPECIAL_SYMBOL, (7, 25))
(;, TokenType.SPECIAL_SYMBOL, (7, 26))
(return, TokenType.KEYWORD, (8, 5))
(0, TokenType.CONSTANT, (8, 12))
(;, TokenType.SPECIAL_SYMBOL, (8, 13))
(}, TokenType.SPECIAL_SYMBOL, (9, 1))

TODO

  • Support other languages: Matlab, Javascript, Typescript,...
  • Auto detect language
  • Parse source to a tree of tokens???

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sctokenizer-0.0.3.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

sctokenizer-0.0.3-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file sctokenizer-0.0.3.tar.gz.

File metadata

  • Download URL: sctokenizer-0.0.3.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.6

File hashes

Hashes for sctokenizer-0.0.3.tar.gz
Algorithm Hash digest
SHA256 e6284550097fde395da91fd5e7bb86b5c4e4c2cde9402b76df917d5ba1f10953
MD5 0ef14009909d6ecc2d88e00c3d80c0e9
BLAKE2b-256 495988931e6a7fd72edd69a383a8f9034e29f256342f9d027edaab07418e1f0d

See more details on using hashes here.

File details

Details for the file sctokenizer-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: sctokenizer-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.6

File hashes

Hashes for sctokenizer-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2586db6f6633005feaf4434a4755f895966bfe80c529ed90cc650e3f799800ce
MD5 9476818ec4e3117d4f835e45ff7228d0
BLAKE2b-256 17bf44c1b4d780f5790899ed4431d7ed16593bb856568fa3d95afd1234796c0b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page