Skip to main content

A lightweight regex-based lexical scanner library.

Project description

Reflex: A lightweight lexical scanner library.

Reflex supports regular expressions, rule actions, multiple scanner states,
tracking of line/column numbers, and customizable token classes.

Reflex is not a "scanner generator" in the sense of generating source code.
Instead, it generates a scanner object dynamically based on the set of
input rules sepecified. The rules themselves are ordinary python regular
expressions, combined with rule actions which are simply python functions.

Example use:

# Create a scanner. The "start" parameter specifies the name of the
# starting state. Note: The state argument can be any hashable python
# type.
scanner = reflex.scanner( "start" )

# Add some rules.
# The whitespace rule has no actions, so whitespace will be skipped
scanner.rule( "\s+" )

# Rules for identifiers and numbers.
TOKEN_IDENT = 1
TOKEN_NUMBER = 2
scanner.rule( "[a-zA-Z_][\w_]*", token=TOKEN_IDENT )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TOKEN_NUMBER )

# The "string" rule kicks us into the string state
TOKEN_STRING = 3
scanner.rule( "\"", tostate="string" )

# Define the string state. "string_escape" and "string_chars" are
# action functions which handle the parsed charaxcters and escape
# sequences and append them to a buffer. Once a quotation mark
# is encountered, we set the token type to be TOKEN_STRING
# and return to the start state.
scanner.state( "string" )
scanner.rule( "\"", tostate="start", token=TOKEN_STRING )
scanner.rule( "\\\\.", string_escape )
scanner.rule( "[^\"\\\\]+", string_text )

Invoking the scanner: The scanner can be called as a function which
takes a reference to a stream (such as a file object) which iterates
over input lines. The "context" argument is for application use,
The result is an iterator which produces a series of tokens.
The same scanner can be used to parse multiple input files, by
creating a new stream for each file.

# Return an instance of the scanner.
token_iter = scanner( istream, context )

Getting the tokens. Here is a simple example of looping through the
input tokens. A real-world use would most likely involve comparing
vs. the type of the current token.

# token.id is the token type (the same as the token= argument in the rule)
# token.value is the actual characters that make up the token.
# token.line is the line number on which the token was encountered.
# token.pos is the column number of the first character of the token.
for token in token_iter:
print token.id, token.value, token.line, token.pos

Action functions are python functions which take a single argument, which
is the token stream instance.

# Action function to handle striing text.
# Appends the value of the current token to the string data
def string_text( token_stream ):
string_data += scanner.token.value

The token_stream object has a number of interesting and usable attributes:

states: dictionary of scanner states
state: the current state
stream: the input line stream
context: the context pointer that was passed to the scanner
token: the current token
line: the line number of the current parse position
pos: the column number of the current parse position

Note - reflex currently has a limit of 99 rules for each state. (That is
the maximum number of capturing groups allowed in a python regular expression.)

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reflex-0.1.tar.gz (4.9 kB view hashes)

Uploaded source

Built Distribution

reflex-0.1-py2.4.egg (9.6 kB view hashes)

Uploaded 2 4

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page