Skip to main content

A module containing scanner (regular expression) and parser (BNF) compilers as well as a base generator, which provides protection and validation, from which all target language generators must inherit from. A script is also included which reads the respective specification(s) from file and outputs the resulting code to disk.

Project description

SPaG (Scanner, Parser, and Generator)



SPaG is an extensible purely Python framework, with no external dependencies, capable of compiling input specifications of scanners (regular grammars) and/or parsers (LL(1) BNF context free grammars) into DFA's and LL(1) parse table's. These results can then be converted into valid scanner/parser programs following the options configured within the framework during generation. The processes used to transform the input specifications into DFA's and LL(1) parse tables are very formal and well defined. Thorough table driven testing is also utilized to ensure correctness of implementation. The program generator(s) comprise the extensible portion of the framework allowing new targeted languages to be easily pluged-in. Included in the framework is a script which lightly wraps the core library functionality to accepts user input file(s) to help drive the compilation/generation process from the command line. The overview below describes in detail what each component sets out to do, how it accomplished those intended goals, and the accepted input.


Installation is supported through multiple methods, all of which are listed below. Any required dependencies along with the steps needed for a proper install are listed. Installation installs the module containing scanner (RegularGrammar), parser (ContextFreeGrammar), and generator (Generator) objects as well as a command line script for easily generating scanners and/or parsers from input specification(s) and optional configuration file. It also installs any supported core language generators which may grow over time.


Install from source for the latest up to date code. This may include unreleased bug fixes, feature extensions, or newly supported language child generators. Since the code is going to built from source it is generally a good idea to also test it before installation or package distribution. For this prupose a Makefile is included to automate testing, linting, virtual environment installation/cleanup, etc. While the framework is pure, dependent free, Python the testing is not and some prerequisites are required. All required python packages are listed in the requirments. However, since these packages are installed in a virtual environment with the help of Make the only requirements the user needs to worry are those which must be installed on the machine itself and include:

  • git (>= v2.1.3)
  • make (>= v4.1)
  • Python (>= v2.7)
  • Pip (>= v9.0.3)
  • setuptools (>= v40.4.3)
  • virtualenv (>= v16.0.0)
# Obtain the source code.
$ git clone

# Run sanity.
# 1. Constructs a virtual environment with SPaG installed for testing.
# 2. Enter the newly built virtual environment.
# 3. Lint the code to ensure standards are followed.
# 4. Test the code and generate a report to ensure a working package.
# 5. Leave the virtual environment.
# 6. Clean up.
$ make sanity

# Open the generated report in a web-browser and inspect the results.
$ chrome test_report.html

# If the results look good then install SPaG from the source.
$ make install


Install a specific prebuilt pacakge distribution from the PyPI repository. Since the distributions are thoroughly tested and linted before publishing there is no need for that step in this method of installation, unlike the source method. The only requirements needed on the host machine for this method to work include:

  • Python (>= v2.7)
  • Pip (>= v9.0.3)
# Only step required!
$ pip install SPaG


The below sections provides a quick overview of each individual core component. It briefly describes what the component consists of and how it accomplished it's tasks and goals.


An object exposing only initialazation and read only properties. The scanner attempts to transform a collection of named patterns (regular expression as token/type pairs) into a unique minimal DFA accepting any input specified while also containing an errors sink state for all invalid input (if required). The transformation begins by first checking the expression for validity while internalizing it. Next, the use of an augmented version of Dijkstra's shunting yard algorithm transforms the expression into prefix notation. From here Thompson's construction is utilized to produce an NFA with epsilon productions. The NFA is then directly converted into a minimal DFA with respect to reachable states using e-closure conversions which are cached. Finally, the minimal DFA is made total, if not already, so it can be further minimized with respect to nondistinguishable states using Hopcroft's algorithm. This results in the smallest possible total DFA which recognizes the given input. The input itself (regular expressions) must be specified following these guidelines:

  • only printable ascii characters (uppercase, lowercase, digits, punctuation, whitespace) are supported.
  • supported core operators (and extensions) include:
    • '|' (union -> choice -> either or)
    • '.' (concatenation -> combine)
    • '?' (question -> choice -> 1 or none)
    • '*' (kleene star -> repitition >= 0)
    • '+' (kleene plus -> repitition >= 1)
    • () (grouping -> disambiguation -> any expression)
    • [ab] (character class -> choice -> any specified character)
    • [a-c] (character range -> choice -> any char between the two)
    • [^ab] (character negation -> choice -> all but the specified)
  • other things to keep in mind (potential gotcha's):
    • character classes and ranges can be combined in the same set of brackets, possibly multiple times (e.g. [abc-pqrs-z]).
    • character ranges can be specified as forward or backwards (e.g. [a-c] or [c-a]).
    • '^' is required to come first after the bracket for negation to properly work. In fact, '^' directly following a bracket is always interpreted as negation.
    • '^' may be used with character classes or ranges.
    • if '^' is alone in the brackets (e.g. [^]) it is translated as any possible input character (i.e. a 'wildcard').
    • if a literal '-' if required in a character class or range it must be specified last so as to not be interpreted as a range.
    • for literal right brackets inside character ranges or classes it must be escaped.
    • concatenation can be either implicit or explicit in the given input expression(s).
    • operator literals (i.e. '|.?*+()[]') can be specified through escape sequences using a backslash (i.e. '').


The parser attempts to transform a collection of BNF production rules into a parse table. The first step in constructing the resulting table is determining the terminal and nonterminal sets, which is very trivial. From here, the sets are used to construct the first and follow sets which identify what characters can be expected when parsing corresponding production rules. Subsequently, the first and follow sets are used to construct the predict sets which are in turn used in the table construction. Finally, the table is verified by checking for conflicts. While any BNF is accepted only LL(1) grammars will produce a valid parse table containing no conflicts. Furthermore, only left factored LL(1) grammars will produce a parse table with a single member per entry. To properly specify a LL(1) grammar the following requirements must be met:

  • No left recursion (direct or indirect)
  • Must be left factored
  • No ambiguity

If any of the above requirements are violated, or the grammar specified is not LL(1), there will be a conflict in the table. This means with the given set of input productions and with the aid of a single look ahead token it cannot be known what rule to choose in order to successfully produce a parse without backtracking.


The base generator is the object which all generators must inherit from. It is responsible for allowing easy getting and setting of the same set of options across all generators as well as providing basic protections against bad option combinations, program input, and generator output. It also allows easy reuse of language generators to compile many specifications into the same output language while allowing easy configuration option changes between generation.


The generators are wrappers on top of the scanner/parser objects and are responsible for translating/compiling the intermediate representations from the output of the scanner/parser into whatever context desired with the given options configured at the time of generation. Most of the time this will be a useable program in some choosen programming language, which should obey the scanner's function type as an input character stream to a token stream while the parser should take that same token stream to an abstract syntax tree (AST). However, this is not always the case since it is possible to write other generators, say for example cowsay. This illustrates the fact that the generators comprise the extensible portion of the package. This plug-in segment of SPaG will grow over time to subsume more generators into the core implementation. These implementation will be for programming languages only. While the newly subsumed generators will be immediately available if building from source, they will only be only available in the pip repository as a newer release of the SPaG package. Below is the status of all the generators currently tracked in the repository, linked to there implementation, and any applicable notes which may apply.


Generator Status Notes
C DEVELOPING Direct scanner generator complete; Parser in progress


A script is included in the package installation to allow easy generation of scanners and/or parsers (and possibly some combination of the two) into any number of output programming languages from the command line. It is a very light weight wrapper on the core functionality in the SPaG package. It also allows easy configuration of genration options and input specifications from command line or configuration file. Simply stated the CLI program drives the generation of the cross product of langauges x scanners x parser. Below shows various examples for the generator to produce scanner's and/or parser's for different token sets and LL(1) languages which may be found under the examples directory. Shown below are some basic invocation's for help, scanner, parser, and configuration file generation.

# Generate your scanner and/or parser! ...but first ask for help to see all
# available command line options.
$ spag_cli --help

# Generate any number of scanners (or parsers) for any set of languages at once!
$ spag_cli -s examples/INI/scanner.txt examples/JSON/scanner.txt -g c go
$ spag_cli -p examples/INI/parser.txt examples/JSON/parser.txt -g c go

# Generate a scanner/parser combo (possibly for a languages front-end reader).
$ spag_cli -s examples/INI/scanner.txt -p examples/INI/parser.txt -g c

# Generate a default configuration file for easier runtime configuration.
$ spag_cli --generate-rcfile .spagrc

# Control the generation [in/out]put with the specified configuration values.
$ spag_cli -c .spagrc

# NOTE: It is possible to override configuration file values with command line
# flag values if they are specified AFTER the configuration flag.
$ spag_cli -c .spagrc # Provide flag(s) for overwriting values here.


Creating a new generator is extremely simple, as it only requires the addition of a single file. This file must contain a class definition following the proper naming conventions set forth for the new output language to be picked up by the pacakge script. Furthermore, all python package prerequisites are listed in the requirments file. The below code block serves as a template for createing new generators which should be placed under spag/generators. The filename (denoted between curly brackets in the template below) should be named after the language being compiled to. Notice should also be taken between the capital and lowercase 'F'.

A scanner/parser generator targeting {filename}.
from spag import Generator

class {Filename}(Generator):
    A simple object for compiling scanner's and/or parser's to {filename} programs.

    def _translate(self):
        Override this method in subclasses which should translate the (IR)
        intermediate representation of the given scanner and/or parser to the
        targeted language. It should return the files and there content in a
        dictionary allowing multiple files to be returned for a given language.
        This is the only required function a child generators must implement.

          dict[str, str]: Generated file names/paths and associated content.
        raise NotImplementedError('{filename} generator in development')

Once you have completed implementing the new generator it needs to be added to the list of supported generators. The tests should also be expanded to ensure the new generator can be properly imported as well. Additionally, the code must be linted, ensuring it follows the proper conventions set forth and therefore produces a similiar look and feel as the rest of the code in the repository. Linting, testing and installing can be automated through the use of the Makefile as shown below.

# Add the new generator to the __all__ list using your favorite editor.
$ vim spag/generators/

# Ask the Makefile what it can do to help in linting, testing and installing.
$ make help

# Run sanity, to ensure no breakage occurs, which lints and tests the code
# against an installed version of the source in a virtual environment.
$ make sanity

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for SPaG, version 1.0.0a0
Filename, size File type Python version Upload date Hashes
Filename, size SPaG-1.0.0a0-py2-none-any.whl (31.7 kB) File type Wheel Python version py2 Upload date Hashes View
Filename, size SPaG-1.0.0a0.tar.gz (35.5 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page