A lightweight regex-based lexical scanner library.
Project description
Reflex: A lightweight lexical scanner library.
Reflex supports regular expressions, rule actions, multiple scanner states,
tracking of line/column numbers, and customizable token classes.
Reflex is not a "scanner generator" in the sense of generating source code.
Instead, it generates a scanner object dynamically based on the set of
input rules sepecified. The rules themselves are ordinary python regular
expressions, combined with rule actions which are simply python functions.
Example use:
# Create a scanner. The "start" parameter specifies the name of the
# starting state. Note: The state argument can be any hashable python
# type.
scanner = reflex.scanner( "start" )
# Add some rules.
# The whitespace rule has no actions, so whitespace will be skipped
scanner.rule( "\s+" )
# Rules for identifiers and numbers.
TOKEN_IDENT = 1
TOKEN_NUMBER = 2
scanner.rule( "[a-zA-Z_][\w_]*", token=TOKEN_IDENT )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TOKEN_NUMBER )
# The "string" rule kicks us into the string state
TOKEN_STRING = 3
scanner.rule( "\"", tostate="string" )
# Define the string state. "string_escape" and "string_chars" are
# action functions which handle the parsed charaxcters and escape
# sequences and append them to a buffer. Once a quotation mark
# is encountered, we set the token type to be TOKEN_STRING
# and return to the start state.
scanner.state( "string" )
scanner.rule( "\"", tostate="start", token=TOKEN_STRING )
scanner.rule( "\\\\.", string_escape )
scanner.rule( "[^\"\\\\]+", string_text )
Invoking the scanner: The scanner can be called as a function which
takes a reference to a stream (such as a file object) which iterates
over input lines. The "context" argument is for application use,
The result is an iterator which produces a series of tokens.
The same scanner can be used to parse multiple input files, by
creating a new stream for each file.
# Return an instance of the scanner.
token_iter = scanner( istream, context )
Getting the tokens. Here is a simple example of looping through the
input tokens. A real-world use would most likely involve comparing
vs. the type of the current token.
# token.id is the token type (the same as the token= argument in the rule)
# token.value is the actual characters that make up the token.
# token.line is the line number on which the token was encountered.
# token.pos is the column number of the first character of the token.
for token in token_iter:
print token.id, token.value, token.line, token.pos
Action functions are python functions which take a single argument, which
is the token stream instance.
# Action function to handle striing text.
# Appends the value of the current token to the string data
def string_text( token_stream ):
string_data += scanner.token.value
The token_stream object has a number of interesting and usable attributes:
states: dictionary of scanner states
state: the current state
stream: the input line stream
context: the context pointer that was passed to the scanner
token: the current token
line: the line number of the current parse position
pos: the column number of the current parse position
Note - reflex currently has a limit of 99 rules for each state. (That is
the maximum number of capturing groups allowed in a python regular expression.)
Reflex supports regular expressions, rule actions, multiple scanner states,
tracking of line/column numbers, and customizable token classes.
Reflex is not a "scanner generator" in the sense of generating source code.
Instead, it generates a scanner object dynamically based on the set of
input rules sepecified. The rules themselves are ordinary python regular
expressions, combined with rule actions which are simply python functions.
Example use:
# Create a scanner. The "start" parameter specifies the name of the
# starting state. Note: The state argument can be any hashable python
# type.
scanner = reflex.scanner( "start" )
# Add some rules.
# The whitespace rule has no actions, so whitespace will be skipped
scanner.rule( "\s+" )
# Rules for identifiers and numbers.
TOKEN_IDENT = 1
TOKEN_NUMBER = 2
scanner.rule( "[a-zA-Z_][\w_]*", token=TOKEN_IDENT )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TOKEN_NUMBER )
# The "string" rule kicks us into the string state
TOKEN_STRING = 3
scanner.rule( "\"", tostate="string" )
# Define the string state. "string_escape" and "string_chars" are
# action functions which handle the parsed charaxcters and escape
# sequences and append them to a buffer. Once a quotation mark
# is encountered, we set the token type to be TOKEN_STRING
# and return to the start state.
scanner.state( "string" )
scanner.rule( "\"", tostate="start", token=TOKEN_STRING )
scanner.rule( "\\\\.", string_escape )
scanner.rule( "[^\"\\\\]+", string_text )
Invoking the scanner: The scanner can be called as a function which
takes a reference to a stream (such as a file object) which iterates
over input lines. The "context" argument is for application use,
The result is an iterator which produces a series of tokens.
The same scanner can be used to parse multiple input files, by
creating a new stream for each file.
# Return an instance of the scanner.
token_iter = scanner( istream, context )
Getting the tokens. Here is a simple example of looping through the
input tokens. A real-world use would most likely involve comparing
vs. the type of the current token.
# token.id is the token type (the same as the token= argument in the rule)
# token.value is the actual characters that make up the token.
# token.line is the line number on which the token was encountered.
# token.pos is the column number of the first character of the token.
for token in token_iter:
print token.id, token.value, token.line, token.pos
Action functions are python functions which take a single argument, which
is the token stream instance.
# Action function to handle striing text.
# Appends the value of the current token to the string data
def string_text( token_stream ):
string_data += scanner.token.value
The token_stream object has a number of interesting and usable attributes:
states: dictionary of scanner states
state: the current state
stream: the input line stream
context: the context pointer that was passed to the scanner
token: the current token
line: the line number of the current parse position
pos: the column number of the current parse position
Note - reflex currently has a limit of 99 rules for each state. (That is
the maximum number of capturing groups allowed in a python regular expression.)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
reflex-0.1.tar.gz
(4.9 kB
view hashes)
Built Distribution
reflex-0.1-py2.4.egg
(9.6 kB
view hashes)