Skip to main content

Flexible, ruleset-based tokenizer using regex.

Project description

lex2-py3


lex2 is a library intended for lexical analysis (also called tokenization). String analysis is performed using regular expressions (regex) in user-defined rules. Some additional functions, such as dynamic ruleset stack, provide flexibility to some degree at runtime.

The library is written in platform independent pure Python3, and is portable (no usage of language-specific features) making it straightforward to port to other programming languages. Furthermore, the library is designed to enable the end-user to easily integrate any external regex engine of their choice through a simple to use unified interface.

Getting Started

As per usual, you can install the library from the Python Package Index (PyPI) through pip:

pip install lex2

You can also choose to manually include the library in your project by cloning or downloading a snapshot of the repository from GitHub and copying the lex2 folder to your project's includes/libraries folder.

Usage of the library is relatively simple, demonstrated by the short example below. For more in-depth examples and using a different regex engines of your choice, see the documentation.

import lex2

# Define ruleset and prepare the lexer object instance
ruleset: lex2.RulesetType = [
    #        Identifier     Regex pattern
    lex2.Rule("WORD",        r"[a-zA-Z]+"),
    lex2.Rule("NUMBER",      r"[0-9]+"),
    lex2.Rule("PUNCTUATION", r"[.,:;!?\\-]")
]
lexer: lex2.ILexer = lex2.make_lexer()(ruleset)

# Load input data by opening a file
lexer.open(r"C:/path/to/file.txt")
# Or by directly passing a string
lexer.load("The quick, brown fox jumps over 2 lazy dogs. \nMr. Jock, TV quiz PhD, bags few lynx.")

# Main tokenization loop
token: lex2.Token
while(1):

    # Find the next token in the textstream
    try: token = lexer.get_next_token()
    except lex2.excs.EOF:
        break

    info = [
         "ln: {}".format(token.pos.ln +1),
        "col: {}".format(token.pos.col+1),
        token.id,
        token.data,
    ]
    print("{: <12} {: <15} {: <20} {: <20}".format(*info))

lexer.close()
>>> ln: 1        col: 1          WORD                 The
>>> ln: 1        col: 5          WORD                 quick
>>> ln: 1        col: 10         PUNCTUATION          ,
>>> ln: 1        col: 12         WORD                 brown
>>> ln: 1        col: 18         WORD                 fox
>>> ln: 1        col: 22         WORD                 jumps
>>> ln: 1        col: 28         WORD                 over
>>> ln: 1        col: 33         NUMBER               2
>>> ln: 1        col: 35         WORD                 lazy
>>> ln: 1        col: 40         WORD                 dogs
>>> ln: 1        col: 44         PUNCTUATION          .
>>> ln: 2        col: 1          WORD                 Mr
>>> ln: 2        col: 3          PUNCTUATION          .
>>> ln: 2        col: 5          WORD                 Jock
>>> ln: 2        col: 9          PUNCTUATION          ,
>>> ln: 2        col: 11         WORD                 TV
>>> ln: 2        col: 14         WORD                 quiz
>>> ln: 2        col: 19         WORD                 PhD
>>> ln: 2        col: 22         PUNCTUATION          ,
>>> ln: 2        col: 24         WORD                 bags
>>> ln: 2        col: 29         WORD                 few
>>> ln: 2        col: 33         WORD                 lynx
>>> ln: 2        col: 37         PUNCTUATION          .

Development Dependencies

For development you will need the following dependencies:

  • Python:
    • Version 3.8+
    • Packages can be installed via requirements.txt, using the following command:
      pip install -r requirements.txt
      
  • Documentation (for diagrams via PlantUML)
    • Java
    • Graphiz

Contributing

The repository is hosted at deltarazero/lex2-py3 on GitHub. Contribution is always welcome; you can contribute by satisfying one of the following points of action:

  • Submitting a pull request: to contribute your own changes to the repository. See "Proposing changes to your work with pull requests" for more information on pull requests using GitHub. Furthermore, please follow the guidelines below:

    • File an issue to notify the maintainers about what you're working on.
    • Fork the repo, develop and test your code changes, add docs/unit tests (if applicable).
    • Make sure that your commit messages clearly describe the changes.
    • Send a pull request, using the available template.

    For changes that address core functionality or would require breaking changes (i.e. for a major release), it's best to open an issue to discuss your proposal beforehand.

    Maintaining your own fork of the repository is discouraged. Instead, please submit pull requests and delete your fork afterwards (if applicable). This will make it less confusing for end-users to know which repository is the most up-to-date.

  • Submitting an issue: to report a problem with the library, request a new feature, or to discuss potential changes before a pull request is created. Ensure the issue was not already reported. Furthermore, please use one of the available issue templates if possible.

License

© 2020-2022 DeltaRazero. All rights reserved.

All included scripts, modules, etc. are licensed under the terms of the zlib license, unless stated otherwise in the respective files.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lex2-1.1.0.tar.gz (27.6 kB view details)

Uploaded Source

Built Distributions

lex2-1.1.0-py3.8.egg (74.7 kB view details)

Uploaded Source

lex2-1.1.0-py3-none-any.whl (36.4 kB view details)

Uploaded Python 3

File details

Details for the file lex2-1.1.0.tar.gz.

File metadata

  • Download URL: lex2-1.1.0.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for lex2-1.1.0.tar.gz
Algorithm Hash digest
SHA256 169dd0c16868e4809c614cc8f5a6cf6a7ff58cbd7bfe98010330d6ea1b804b4c
MD5 dc085a933103581bc9553d66d5f19dd9
BLAKE2b-256 01a8f1de18a17a16226deb6a5cedd2beca3c20700ef5aef7529930ddeafbbbf7

See more details on using hashes here.

File details

Details for the file lex2-1.1.0-py3.8.egg.

File metadata

  • Download URL: lex2-1.1.0-py3.8.egg
  • Upload date:
  • Size: 74.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for lex2-1.1.0-py3.8.egg
Algorithm Hash digest
SHA256 1037d99841eda7e92ac237abf780280fe4667427fe68c2d22a8f5e4dd2277ed0
MD5 74c0414a23d949a30e09e50ee1729416
BLAKE2b-256 bfecf9a72ef32d7673fa75241c3df212a76356d3ced4d7d6dd3f7124e1d0d188

See more details on using hashes here.

File details

Details for the file lex2-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: lex2-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for lex2-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c3d46de0b0aec6811c026bb7a3f2ba18b845ca8a2aebcf857466f463a441bee6
MD5 0f62f4e99e00d9c737bfd838d9cc0f7a
BLAKE2b-256 1b40d26cdc1ff2f43c9cd6c0abc41904cb0f7c557c7555fbeafe93db128806cc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page