A Pure Python Lexer & Tokenizer Base Library
Project description
Lexographer: A Pure Python Lexer, Tokenizer & Parser Base Library
The Lexographer library provides pure Python lexer, tokenizer and parser base classes for building custom lexers, tokenizers and parsers.
Requirements
The Lexographer library has been tested with Python 3.10, 3.11, 3.12 and 3.13. The library is not compatible with Python 3.9 or earlier.
Installation
The Lexographer library is available from PyPI, so may be added to a project's
dependencies via its requirements.txt file or similar by referencing the GraphQuery
library's name, lexographer, or the library may be installed directly into your local
runtime environment using pip via the pip install command by entering the following
into your shell:
$ pip install lexographer
Methods & Properties
The Lexographer library provides classes and base classes which can be used and extended to build custom lexers, tokenizers and parsers for structured text such as query strings or programming language code, as well as natural language text.
The classes and their methods and properties are listed below:
Lexer Class
The Lexer class supports lexing the provided text, and provides standard methods for
moving through structured text as are commonly needed in lexing operations, such as methods
to read, peek and consume the one or more characters from the current cursor position, as
well as to look-ahead and look-behind from the current cursor position for matching text.
These operations can be built upon within custom Tokenizer and Parser subclasses to
develop tokenizers and parsers for structured text and natural language text inputs.
The Lexer class constructor Lexer(...) takes the following arguments:
-
text(str) – The optionaltextargument sets the text to lex. -
file(str) – The optionalfileargument sets the file path of the file to lex.
Either one of the text or the file argument must be specified when instantiating an
instance of the Lexer class with valid values. If neither argument is specified, or is
invalid, an exception will be raised.
The Lexer class provides the following methods:
read(length: int = 1)(str) – Theread()method supports reading the specified length of the input from the current cursor (index) position. By default the method will read and return a single character from the current cursor position or a custom length as specified by the optionallength(int) parameter.
If the specified value for the length parameter exceeds the number of available characters
remaining before the end of the specified text string or the end of the file, only the
remaining characters will be returned.
The read() method advances the cursor position after completing the read, so that on
the next read, the new cursor position is used.
peek(length: int = 1)(str) – Thepeek()method supports reading the specified length of the input from the current cursor (index) position. By default the method will read and return a single character from the current cursor position or a custom length as specified by the optionallength(int) parameter.
If the specified value for the length parameter exceeds the number of available characters
remaining after the current cursor position, a LexerError exception will be raised.
The peek() method does not advance the cursor position after completing the read, so
on the next read, the same, unmodified cursor position is used. This allows for one or
more characters from the current cursor position to be read and checked without affecting
the cursor.
previous(length: int = 1)(str) – Theprevious()method supports reading the specified length of the input from the current cursor (index) position. By default the method will read and return a single character prior to the current cursor position or a custom length as specified by the optionallength(int) parameter.
If the specified value for the length parameter exceeds the number of available characters
remaining before the current cursor position, a LexerError exception will be raised.
The previous() method does not advance the cursor position after completing the read, so
on the next read, the same, unmodified cursor position is used. This allows for one or
more characters prior to the current cursor position to be read and checked without
affecting the cursor.
consume(length: int = 1)(None) – Theconsume()method supports moving the current cursor position forwards according to the specified length. By default the method will move the current cursor position a single character forwards from the current cursor position or the custom length as specified by the optionallength(int) parameter, adjusting the current cursor position by the relevant number of characters.
The method does not return a value; it simply moves the current cursor position.
If the specified value for the length parameter exceeds the number of available characters
remaining before the current cursor position, a LexerError exception will be raised.
push(length: int = 1)(None) – Thepush()method supports moving the current cursor position backwards according to the specified length. By default the method will move the current cursor position a single character backwards from the current cursor position or the custom length as specified by the optionallength(int) parameter, adjusting the current cursor position by the relevant number of characters.
The method does not return a value; it simply moves the current cursor position.
If the specified value for the length parameter exceeds the number of available characters
remaining before the current cursor position, a LexerError exception will be raised.
-
lookbehind(text: str, length: int = 1)(bool) – Thelookbehind()method supports checking if the specified text can be matched exactly with the text of the specified length prior to the current cursor position. If the specifiedtextvalue exactly matches the text prior to the current cursor position of the specified length, the method returnsTrue, otherwise the method returnsFalse. By default the method will compare the providedtextstring with a single character of text prior to the current cursor position, or the custom length as specified by the optionallength(int) parameter. -
lookahead(text: str, length: int = 1)(bool) – Thelookahead()method supports checking if the specified text can be matched exactly with the text of the specified length from the current cursor position. If the specifiedtextvalue exactly matches the text from the current cursor position of the specified length, the method returnsTrue, otherwise the method returnsFalse. By default the method will compare the providedtextstring with a single character of text from the current cursor position, or the custom length as specified by the optionallength(int) parameter.
The Lexer class provides the following properties:
-
text(str) – Thetextproperty provides access to the text string that theLexerclass was instantiated with, either via the initializer'stextargument or by reading the file specified via the initializer'sfileargument. -
file(str | None) – Thefileproperty provides access to the file path that theLexerclass was instantiated with, via the initializer'sfileargument, if one was specified during class initialization. -
length(int) – Thelengthproperty provides access to the length of the text string that theLexerclass was instantiated with, either via the initializer'stextargument or by reading the file specified via the initializer'sfileargument. -
index(int) – Theindexproperty provides access to the current cursor index position of the text string that theLexeris processing. During class initialization the index is set to its default value of0and will advance from there, or be moved backwards as necessary, according to the method calls made to theLexer. The value will never be less than0nor more than the length of the text string being processed less one as theindexis a zero-indexed position, rather than a one-indexed position. This same value is also available via thePositioninstance'sindexproperty. -
position(Position) – Thepositionproperty provides access to the current cursor position via an instance of thePositionclass, which provides access to the cursor's current zero-indexed position, as well as the relevant line and column numbers. See the information below regarding thePositionclass and its properties. -
line(int) – Thelineproperty provides access to the line number corresponding with the cursor's current position in the text string being processed. This same value is also available via thePositioninstance'slineproperty. -
column(int) – Thecolumnproperty provides access to the column number corresponding with the cursor's current position in the text string being processed. This same value is also available via thePositioninstance'scolumnproperty. -
characters(str) – Thecharactersproperty provides access to the most recently read character or characters, read via theread()method. The length of the returned string will be dependent on if theread()method was called with a customlengthvalue or not. The value returned by thecharactersproperty is only affected by calls to theread()method, not to any otherLexerclass method such aspeek().
Position Class
The Position class supports reporting the Lexer class' current cursor position within
the text being lexed, including the current character index position, and the corresponding
line and column numbers.
The Position class offers the following methods:
-
copy()(Position) – Thecopy()method creates an exact copy of the currentPositionclass instance and returns it. -
adjust(offset: int, line: int = None, column: int = None)(Position) – Theadjust()method supports modifying the the attributes of the currentPositionclass instance, such as changing the currentindexrelative to the specifiedoffset– if theoffsethas a positive value, it will be added to the currentindexvalue, and if it has a negative value it will be subtracted from the currentindexvalue; thelineandcolumnvalues can be set to the new values if specified; if thelineandcolumnvalues are not specified these values will default to0indicating that nolineorcolumnvalue is available as both thelineandcolumnnumbering starts at1.
The Position class offers the following properties:
-
index(int) – Theindexproperty provides access to the current cursor index position of the text string that theLexeris processing. During class initialization the index is set to its default value of0and will advance from there, or be moved backwards as necessary, according to the method calls made to theLexer. The value will never be less than0nor more than the length of the text string being processed less one as theindexis a zero-indexed position, rather than a one-indexed position. -
line(int) – Thelineproperty provides access to the line number corresponding with the cursor's current position in the text string being processed. -
column(int) – Thecolumnproperty provides access to the column number corresponding with the cursor's current position in the text string being processed.
Instances of the Position class should not need to be created manually, rather instances are
returned whenever the Lexer class' position property is accessed.
Tokenizer Class
The Tokenizer class provides support for translating the provided text into a series
of Token class instances which represent all or part of the lexed text.
The Tokenizer class offers the following methods:
-
__len__()(int) – The__len__()method provides access to the current length of theTokenizerclass' internal list of tokenized tokens. -
__iter__()(Tokenizer) – The__iter__()method provides support for iterating over theTokenizerclass' internal list of tokenized tokens using standard Python iterator patterns. -
__next__()(Tokenizer) – The__next__()method provides support for iterating over theTokenizerclass' internal list of tokenized tokens using standard Python iterator patterns. -
next()(Token|None) – Thenext()method provides support for obtaining the nextTokenfrom theTokenizerclass' internal list of tokenized tokens. -
next(offset: int = 0)(Token|None) – Thenext()method provides support for obtaining the nextTokenfrom theTokenizerclass' internal list of tokenized tokens either at the current cursor position, or from the current cursor position plus the offset specified via the optionaloffsetargument. Thenext()method will also advance the current cursor position within theTokenizerclass' internal list of tokenized tokens. If the specified offset is out of the bounds of the list, the method will returnNone. -
seek(index: int = 0)(Token|None) – Theseek()method provides support for seeking to the specified index within theTokenizerclass' internal list of tokenized tokens. Theseek()method will seek back to the beginning of the list by default, unless a different index position is specified. The specifiedindexposition must be between0and the current length of the token list less one as the value is zero-indexed. -
peek(offset: int = 0)(Token|None) – Thepeek()method provides support for obtaining the nextTokenfrom theTokenizerclass' internal list of tokenized tokens either at the current cursor position, or from the current cursor position plus the offset specified via the optionaloffsetargument. Thepeek()method will not advance the current cursor position within theTokenizerclass' internal list of tokenized tokens, so allows a token to be checked without affecting the current list position. If the specified offset is out of the bounds of the list, the method will returnNone. -
previous(offset: int = 0)(Token|None) – Theprevious()method provides support for obtaining the previousTokenfrom theTokenizerclass' internal list of tokenized tokens prior to the current cursor position, or from the current cursor position plus the offset specified via the optionaloffsetargument. Theprevious()method will modify the current cursor position within theTokenizerclass' internal list of tokenized tokens to the position of the token being retrieved. If the specified offset is out of the bounds of the list, the method will returnNone. -
parse()(None) – Theparse()abstract method must be implemented in custom subclass implementations of theTokenizerbase class in order to tokenize the provided source text into one or moreTokenclass instances. See the documentation and the test suite for examples of how to implement a customTokenizersubclass and to override theparse()method.
The Tokenizer class offers the following properties:
-
lexer(Lexer) – Thelexerproperty provides access to the currentLexerclass instance associated with and being used by theTokenizerclass instance. -
text(str) – Thetextproperty provides access to the current text being tokenized by theTokenizerinstance where the text is either text that was directly supplied to theTokenizerclass instance or sourced from the file that was specified during instantiation. -
file(str) – Thefileproperty provides access to the current file being tokenized by theTokenizerinstance, if one was specified during instantiation. -
tokens(list[Token]) – Thetokensproperty provides access to the list ofTokenclass instances which have been tokenized by the theTokenizerinstance from the provided source text. -
context(Context) – Thecontextproperty provides access to the current tokenizer context, a custom property that can be used to keep track of where within source text that theTokenizercurrently is working. The property returns theContextenumeration value that was assigned within the customTokenizersubclass'parse()method, and can be one of the defaultContextenumeration values provided by the library by default, or a customContextenumeration value that has been added to theContextenumeration class. See theContextenumeration documentation below for more details. -
index(int) – Theindexproperty provides access to the current position within the list of tokens that have been tokenized from the provided text or file; this property keeps track for the current position for theTokenizerclass' iterator support. -
length(int) – Thelengthproperty provides access to the current length of the list of tokens that have been tokenized from the provided text or file; this property keeps track for the current position for theTokenizerclass' iterator support. -
level(int) – Thelevelproperty provides access to the current level of depth within the source text that theTokenizeris processing; this property can be used in the customparse()method implementation within the customTokenizersubclass to keep track of tokenizing within multi-level structured text such as indentation levels within code or queries for example. Thelevelproperty value is not directly managed by theTokenizerbase class implementation, and if not used will just return0. -
character(str) – Thecharacterproperty provides access to the current character within the source text that theTokenizeris processing. -
token(Token) – Thetokenproperty provides access to the append the most recently tokenizedTokeninstance to the list of tokens that have been tokenized by theTokenizerclass instance. Thetokenproperty does not provide a getter implementation, only a setter implementation, and is provided as a convenience to append tokens to the internal list of tokens and to update the token length counter.
Token Class
The Token class provides support for representing an tokenized piece of lexed text,
including attributes such as the position that the piece of lexed text originated in the
source text, and other assigned attributes such as its type or purpose within the source
text, such as if it represents punctuation or a word for example.
The Token class offers the following methods:
-
__str__(str) – The__str__()method returns a string representation of the token for logging or debugging purposes. -
__repr__(str) – The__repr__()method returns a string representation of the token for debugging purposes with more detail than the string representation provided by__str__().
The Token class offers the following properties:
-
tokenizer(Tokenizer) – Thetokenizerproperty provides access to the currentTokenizerclass instance that is associated with theTokenclass instance. -
lexer(Lexer) – Thelexerproperty provides access to the currentLexerclass instance that is associated with theTokenclass instance. -
type(Type) – Thetypeproperty provides access to theTokenclass' assignedtypevalue, which is aTypeenumeration value which specifies what type of token is represented by theTokenclass instance, such as a punctuation character or a word or a number for example. Thetypevalue will be one of the values provided by theTypeenumeration class by default or a custom enumeration option value added to theTypeclass during customization. -
name(str) – Thenameproperty provides access to the name of the type of theTokenas assigned to theTokenclass'typeproperty. -
position(Position) – Thepositionproperty provides access to theTokenclass' position within the source text. -
length(int) – Thelengthproperty provides access to the length of theTokenclass' text value, which is the substring from source text represented by this token of one or more characters. -
text(str) – Thetextproperty provides access to theTokenclass' text value, which is the substring from source text represented by this token of one or more characters. -
level(int) – Thelevelproperty provides access to theTokenclass' level value, which, if assigned during tokenization, will be the relevant level within the source text that the token was obtained from. This property is not managed by the library itself, but can be set in custom tokenization code to set the relevant level from the source text that a token exists at, this is particularly useful when tokenizing multi-level structured text such as code where there may be different levels of indentation. -
printable(str) – Theprintableproperty provides access to a printable version of theTokenclass' text value, where a subset of special characters such as tabs, spaces, new lines and carriage returns are translated to a printable character for debugging purposes and all remaining characters are returned as-is.
Tokens Class
The Tokens class provides support for creating collections of one or more Token class
instances, which can be moved through like lists using standard Python iterator functionality.
The Tokens class offers the following methods:
-
clear()(None) – Theclear()method provides support for clearing the current list ofTokenclass instances from the internal list maintained by theTokensclass. -
add(token: Token)(None) – Theadd()method provides support for adding aTokento the internal list ofTokenclass instances maintained by theTokensclass. -
__len__()(int) – The__len__()method provides access to the current length of the internal list ofTokenclass instances maintained by theTokensclass instance. -
__iter__()(Tokenizer) – The__iter__()method provides support for iterating over theTokensclass' internal list ofTokenclass instances using standard Python iteration patterns. -
__next__()(Tokenizer) – The__next__()method provides support for iterating over theTokensclass' internal list ofTokenclass instances using standard Python iteration patterns.
The Tokens class offers the following properties:
-
context(Context) – Thecontextproperty provides access to the context value that was assigned to theTokensclass instance at the time of instantiation. -
name(str) – Thenameproperty provides access to the name value that was assigned to theTokensclass instance at the time of instantiation. -
note(str) – Thenoteproperty provides access to the name value that was assigned to theTokensclass instance at the time of instantiation. -
level(int) – Thelevelproperty provides access to the level value obtained from the tokens held by theTokensclass instance. It is expected that all of theTokenclass instances held by aTokensclass instance will all have originated from the same level within the source text and as such will all have been assigned the same level value, or that their level value was not set and defaulted to0. If one or more of theTokenclass instances held by aTokensclass instance are found to have originated from more than one level from the source text anTokenizerErrorexception will be raised. -
token(Token) – Thetokenproperty is provided as a convenience for adding aTokenclass instance to the internal list ofTokenclass instances held by theTokensclass instance. Thetokenproperty only defines the setter method to support this convenience, and does not provide a getter method implementation.
Parser Class
The Parser class provides support for creating custom Parser subclasses that can be
used to parse through the tokenized text and to generate custom output.
The Parser class offers the following methods:
parse()(None) – Theparse()abstract method must be implemented in custom subclass implementations of theParserbase class in order to parse the provided tokens into the desired output. See the documentation and the test suite for examples of how to implement a customParsersubclass and to override theparse()method.
The Parser class offers the following properties:
-
text(str) – Thetextproperty provides access to the text that has been tokenized by the customTokenizersubclass into the tokens being used by the parser. -
encoding(str|None) – Theencodingproperty provides access to the specified encoding of the source text, if any, that was set at the time of instantiation. -
tokenizer(Tokenizer) – Thetokenizerproperty provides access to the currentTokenizerclass instance associated with the currentParserclass instance. -
context(Context|None) – Thecontextproperty provides access to the specified context of theParser, if any, returned as aContextenumeration class value. This value is not maintained by the library, but rather can be used in custom subclass implementations of theParserclass to keep track of context state.
Example Usage
See the test suite for example usage, including examples of custom Tokenizer and Parser
subclasses used to parse raw structured text into tokens and downstream outputs.
Unit Tests
The Lexographer library includes a suite of comprehensive unit tests which ensure that
the library functionality operates as expected. The unit tests were developed with and
are run via pytest.
To ensure that the unit tests are run within a predictable runtime environment where all
of the necessary dependencies are available, a Docker image is
created within which the tests are run. To run the unit tests, ensure Docker and Docker
Compose is installed, and perform the following
commands, which will build the Docker image via docker compose build and then run the
tests via docker compose run – the output of running the tests will be displayed:
$ docker compose build
$ docker compose run tests
To run the unit tests with optional command line arguments being passed to pytest, append
the relevant arguments to the docker compose run tests command, as follows, for example
passing -vv to enable verbose output:
$ docker compose run tests -vv
See the documentation for PyTest regarding available optional command line arguments.
Copyright & License Information
Copyright © 2026 Daniel Sissman; licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lexographer-0.8.1.tar.gz.
File metadata
- Download URL: lexographer-0.8.1.tar.gz
- Upload date:
- Size: 23.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0dedd178d386a7c86d350a240208fc39fd8d09556a8ec9ca81a7cba9f7a9d82e
|
|
| MD5 |
b6b0ac406441ca993f0e5f577d48a530
|
|
| BLAKE2b-256 |
c029897ae5a24167884d9328bbe1594e91306de60ad3e0f92700219e76897d7f
|
Provenance
The following attestation bundles were made for lexographer-0.8.1.tar.gz:
Publisher:
python-publish.yml on bluebinary/lexographer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lexographer-0.8.1.tar.gz -
Subject digest:
0dedd178d386a7c86d350a240208fc39fd8d09556a8ec9ca81a7cba9f7a9d82e - Sigstore transparency entry: 834299071
- Sigstore integration time:
-
Permalink:
bluebinary/lexographer@0c0507193e4ae15217b69f28786fe85dbca86ccc -
Branch / Tag:
refs/tags/v0.8.1 - Owner: https://github.com/bluebinary
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@0c0507193e4ae15217b69f28786fe85dbca86ccc -
Trigger Event:
release
-
Statement type:
File details
Details for the file lexographer-0.8.1-py3-none-any.whl.
File metadata
- Download URL: lexographer-0.8.1-py3-none-any.whl
- Upload date:
- Size: 18.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f69060b7728609cdd1d3920c4104ec755c663036ef6ed300787a470b37f48b8f
|
|
| MD5 |
825ab48277f8865bf52634ab64b9d489
|
|
| BLAKE2b-256 |
85e7a463723f72549de5ab8c71a65b788dfa68aec1f498cae1ad676027c1254e
|
Provenance
The following attestation bundles were made for lexographer-0.8.1-py3-none-any.whl:
Publisher:
python-publish.yml on bluebinary/lexographer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lexographer-0.8.1-py3-none-any.whl -
Subject digest:
f69060b7728609cdd1d3920c4104ec755c663036ef6ed300787a470b37f48b8f - Sigstore transparency entry: 834299073
- Sigstore integration time:
-
Permalink:
bluebinary/lexographer@0c0507193e4ae15217b69f28786fe85dbca86ccc -
Branch / Tag:
refs/tags/v0.8.1 - Owner: https://github.com/bluebinary
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@0c0507193e4ae15217b69f28786fe85dbca86ccc -
Trigger Event:
release
-
Statement type: