Skip to main content

Token matching parser

Project description

tokema

tokema - token matching parser.

It is a pure python, zero-dependency library for building parsers. tokema algorithm based on Tomita's GLR* parser.

It helps to create noise-skipping grammar-based parsers at runtime and apply them to various task like entity extraction and token matching.

tokema algorithm is the extended version on Noise-skipping GLR* parser by Lavie Alon and Masaru Tomita:

Lavie, Alon, and Masaru Tomita. "10. GLR*-AN EFFICIENT NOISE-SKIPPING PARSING ALGORITHM FOR CONTEXT-FREE GRAMMARS." Recent Advances in Parsing Technology 1 (1996): 183.

tokema parsing works with arbitrary tokens - chars, words, structures... you can write a parser for token stream produced by the tokenizer of your choice.

GLR

In general, tokema - is a GLR-based (Generalized LR) context-free parser generator which basically means that parser produced by tokema tries to find most suitable parses (might be multiple) given deterministic grammar. GLR-based algorithms evaluate multiple parser states at the same time allowing to parse using incomplete and poorly structured grammars.

Noise-skipping

The term "noise skipping" means that parser can skip tokens it doesn't understand which is required to solve natural language tasks.

Extensions

Instead of direct token matching (or dictionary based) like in most traditional parsers tokema replaces terminals and non-terminals in the grammar with queries and resolvers - functions that matches the input token with the given query.

By creating these functions you can construct token queries and resolvers of your choice integrating lots of cool features into your parser. For example you can create a custom resolver matches stemmed or lemmatized text tokens. Or create a resolver that uses custom dictionaries or even integrate a third party search backend like lucene.

Installation

pip install tokema

Usage

Basic usage:

from tokema import *

# Define grammar
# NOTE: you can also define it using python classes only or create you own grammar syntax
rules = parse_rules_from_string("""
ROOT = <EXPR>
EXPR = {float} + {float}
""")

# Parsing table construction
table = build_text_parsing_table(rules)

# Input tokens
# NOTE: In real application you should use some real tokenizer (like nltk)
tokens = 'this will be ignored 3.1415 and + this 4e-10'.split()

# Actual parsing using token stream and parsing table
# There may be multiple parses (not in this oversimplified scenario) so parse returns a list
results = parse(tokens, table)

for result in results:
    # Here you can access the parsed AST (abstract syntax tree)
    print(result)  # ROOT(EXPR(3.1415, +, 4e-10))

    # EXPR is the first child of ROOT
    expr = result[0]

    # Convert each value to a float and make sure that result is actually correct
    assert float(expr[0].value) + float(expr[2].value) == 3.1415 + 4e-10

For more usage scenarios see examples folder

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokema-0.0.2.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

tokema-0.0.2-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file tokema-0.0.2.tar.gz.

File metadata

  • Download URL: tokema-0.0.2.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.9.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for tokema-0.0.2.tar.gz
Algorithm Hash digest
SHA256 ae74ebc2b1128dffef8cf56f60ee44e2e9bea5fcc4a44233a058277078c89d2f
MD5 da4e13c4713cfbb47710ba64023e7727
BLAKE2b-256 b14fca23b9f97a22d6a594a3cf46ca7a0d3cdab8fd8da1e3bb8c47081efed592

See more details on using hashes here.

File details

Details for the file tokema-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: tokema-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.9.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for tokema-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c8cc4e2e5758ff62c7dc153f753e9eb242638a884cf6757f91e8c96bfbc89e0b
MD5 a9e233fac37f058f157d893e6e50171d
BLAKE2b-256 8e35bbddc3c6e904911a51124205c03fa26cd6563bcc75c965d59020b890dbb5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page