tokenizertools

Implements lexical analyzers as iterators yielding tokens.

Project description

tokenizertools

Text file tokenizers that support multiple start states and lexical tracking implemented as standard Python iterators.

Classes provided:

Tokenizer – Base class.
RegexTokenizer – Python re-based tokenizer.
TokenizeAhead – A look-ahead iterator that can wrap any tokenizer.

Overview

The class RegexTokenizer implements a tokenizer using the re module to recognize tokens in the input stream. Tokens and actions are defined by rules. The tokenizer calls user action functions associated with each rule. In most cases, the user action function can simply be a @classmethod constructor of a user-provided token class.

Each rule is specified as a tuple. The first element of the tuple is a regular expression that will be compiled by re and used to match a token. The second element of the tuple is a user-provided callable that will be passed the recognized text, along with the current lexical position.

In this example, the user class Token implements constructors as @classmethod functions, and these serve as the callables in each lexical rule.

Rules are specified in the class variable spec, which is a list of rules.

import tokenizertools as tt
class MyTokenizer(tt.RegexTokenizer):
    spec = [
        (r'[a-zA-Z][a-zA-Z0-9_]*',Token.type_ident), # idents and keywords
        (r'[0-9]+\.[0-9]+',Token.type_float), # floats
        (r'[0-9]+', Token.type_int), # ints
        (r'\s*',None), # ignore white space
    ]

Nothing else needs to be defined. All methods are inherited. Instantiate a lexer and commence parsing. The specification rules are compiled and cached on creation of the first instance.:

tokenizer = MyTokenizer()
with open('foo.bar') as f:
    token_stream = Lookahead(tokenizer.lex(f, f.name))
    compiled_stuff = my_parser.parse(token_stream)

tokenizertools news

Release history.

15-Sept-2016

1.0 Time to put some mileage on it.

Clean ups, PEP8-ification. Improved exception handling for bad begin() state.

16-June-2014

0.1a Initial alpha.

Project details

Release history Release notifications | RSS feed

This version

1.0

Sep 16, 2016

0.1a pre-release

Jun 17, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizertools-1.0.tar.gz (10.9 kB view details)

Uploaded Sep 16, 2016 Source

File details

Details for the file tokenizertools-1.0.tar.gz.

File metadata

Download URL: tokenizertools-1.0.tar.gz
Upload date: Sep 16, 2016
Size: 10.9 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for tokenizertools-1.0.tar.gz
Algorithm	Hash digest
SHA256	`44074a442b456c48918ee99cb09ff9b7fccc3708f94e13d13d0ff637daf3f7f7`
MD5	`944878faa5c1859e3db1b1301592cfdb`
BLAKE2b-256	`42813efe92537748299e3cd7cbfe519819855d17a6aecd92bb652d2310496bb7`

See more details on using hashes here.

tokenizertools 1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

tokenizertools

Overview

tokenizertools news

15-Sept-2016

16-June-2014

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes