Flexible, ruleset-based tokenizer using regex.
Project description
lex2-py3
Simple tokenizer using regex.
lex2 is a library intended for lexical analysis (also called tokenization). String analysis is performed using regular expressions (regex), as specified in user-defined rules. Mechanisms, such as a dynamic ruleset-stack, provide flexibility to some degree at runtime.
The library is written in platform independent, pure Python3, and is portable (i.e. no usage of language-specific features) so that it is straightforward to port the library to other programming languages. Furthermore, the library is designed to enable the end-user to easily use any external regex engine of their choice, while maintaining to offer a simple to use unified interface.
Getting Started
It is recommended to install the library from the Python Package Index (PyPI) through Python's package manager pip
:
pip install lex2
However, you can also choose to manually include the library in your project by downloading a release on GitHub and copying the lex2
folder to your project's includes/libraries folder.
Usage of lex2 is relatively simple, as demonstrated by the short example below. For more in-depth examples and using external regex engines of your choice, see the documentation.
import lex2
# Define ruleset and prepare the lexer object instance
ruleset: lex2.ruleset_t = [
# Identifier Regex pattern
lex2.Rule("WORD", r"[a-zA-Z]+"),
lex2.Rule("NUMBER", r"[0-9]+"),
lex2.Rule("PUNCTUATION", r"[.,:;!?\\-]")
]
lexer: lex2.ILexer = lex2.MakeLexer(ruleset=ruleset)
# Load input data by opening a file
lexer.Open(r"C:/path/to/file.txt")
# Or by directly passing a string
lexer.Load("The quick, brown fox jumps over 2 lazy dogs. \nMr. Jock, TV quiz PhD, bags few lynx.")
# Main tokenization loop
token: lex2.Token
while(1):
# Find the next token in the textstream
try: token = lexer.GetNextToken()
except lex2.excs.EndOfData:
break
info = [
"ln: {}".format(token.position.ln +1),
"col: {}".format(token.position.col+1),
token.id,
token.data,
]
print("{: <12} {: <15} {: <20} {: <20}".format(*info))
lexer.Close()
>>> ln: 1 col: 1 WORD The
>>> ln: 1 col: 5 WORD quick
>>> ln: 1 col: 10 PUNCTUATION ,
>>> ln: 1 col: 12 WORD brown
>>> ln: 1 col: 18 WORD fox
>>> ln: 1 col: 22 WORD jumps
>>> ln: 1 col: 28 WORD over
>>> ln: 1 col: 33 NUMBER 2
>>> ln: 1 col: 35 WORD lazy
>>> ln: 1 col: 40 WORD dogs
>>> ln: 1 col: 44 PUNCTUATION .
>>> ln: 2 col: 1 WORD Mr
>>> ln: 2 col: 3 PUNCTUATION .
>>> ln: 2 col: 5 WORD Jock
>>> ln: 2 col: 9 PUNCTUATION ,
>>> ln: 2 col: 11 WORD TV
>>> ln: 2 col: 14 WORD quiz
>>> ln: 2 col: 19 WORD PhD
>>> ln: 2 col: 22 PUNCTUATION ,
>>> ln: 2 col: 24 WORD bags
>>> ln: 2 col: 29 WORD few
>>> ln: 2 col: 33 WORD lynx
>>> ln: 2 col: 37 PUNCTUATION .
Contributing
The repository is hosted at deltarazero/liblex2-py3 on GitHub. Contribution is always welcome; you can contribute by satisfying one of the following points of action:
-
Submitting a pull request: to contribute your own changes to the repository. See "Proposing changes to your work with pull requests" for more information on pull requests using GitHub. Furthermore, please follow the guidelines below:
- File an issue to notify the maintainers about what you're working on.
- Fork the repo, develop and test your code changes, add docs/unit tests (if applicable).
- Make sure that your commit messages clearly describe the changes.
- Send a pull request, using the available template.
For changes that address core functionality or would require breaking changes (i.e. for a major release), it's best to open an issue to discuss your proposal beforehand.
Maintaining your own fork of the repository is discouraged. Instead, please submit pull requests and delete your fork afterwards (if applicable). This will make it less confusing for end-users to know which repository is the most up-to-date.
-
Submitting an issue: to report a problem with the library, request a new feature, or to discuss potential changes before a pull request is created. Ensure the issue was not already reported. Furthermore, please use one of the available issue templates if possible.
License
© 2020-2021 DeltaRazero. All rights reserved.
All included scripts, modules, etc. are licensed under the terms of the zlib license, unless stated otherwise in the respective files.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.