A lightweight regex-based lexical scanner library.

## Project description

Reflex: A lightweight lexical scanner library.

Reflex supports regular expressions, rule actions, multiple scanner states,
tracking of line/column numbers, and customizable token classes.

Reflex is not a "scanner generator" in the sense of generating source code.
Instead, it generates a scanner object dynamically based on the set of
input rules sepecified. The rules themselves are ordinary python regular
expressions, combined with rule actions which are simply python functions.

Example use:

# Create a scanner. The "start" parameter specifies the name of the
# starting state. Note: The state argument can be any hashable python
# type.
scanner = reflex.scanner( "start" )

# The whitespace rule has no actions, so whitespace will be skipped
scanner.rule( "\s+" )

# Rules for identifiers and numbers.
TOKEN_IDENT = 1
TOKEN_NUMBER = 2
scanner.rule( "[a-zA-Z_][\w_]*", token=TOKEN_IDENT )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TOKEN_NUMBER )

# The "string" rule kicks us into the string state
TOKEN_STRING = 3
scanner.rule( "\"", tostate="string" )

# Define the string state. "string_escape" and "string_chars" are
# action functions which handle the parsed charaxcters and escape
# sequences and append them to a buffer. Once a quotation mark
# is encountered, we set the token type to be TOKEN_STRING
scanner.state( "string" )
scanner.rule( "\"", tostate="start", token=TOKEN_STRING )
scanner.rule( "\\\\.", string_escape )
scanner.rule( "[^\"\\\\]+", string_text )

Invoking the scanner: The scanner can be called as a function which
takes a reference to a stream (such as a file object) which iterates
over input lines. The "context" argument is for application use,
The result is an iterator which produces a series of tokens.
The same scanner can be used to parse multiple input files, by
creating a new stream for each file.

# Return an instance of the scanner.
token_iter = scanner( istream, context )

Getting the tokens. Here is a simple example of looping through the
input tokens. A real-world use would most likely involve comparing
vs. the type of the current token.

# token.id is the token type (the same as the token= argument in the rule)
# token.value is the actual characters that make up the token.
# token.line is the line number on which the token was encountered.
# token.pos is the column number of the first character of the token.
for token in token_iter:
print token.id, token.value, token.line, token.pos

Action functions are python functions which take a single argument, which
is the token stream instance.

# Action function to handle striing text.
# Appends the value of the current token to the string data
def string_text( token_stream ):
string_data += scanner.token.value

The token_stream object has a number of interesting and usable attributes:

states: dictionary of scanner states
state: the current state
stream: the input line stream
context: the context pointer that was passed to the scanner
token: the current token
line: the line number of the current parse position
pos: the column number of the current parse position

Note - reflex currently has a limit of 99 rules for each state. (That is
the maximum number of capturing groups allowed in a python regular expression.)

## Project details

