pygmars

Craft simple regex-based small language lexers and parsers. Build parsers from grammars and accept Pygments lexers as an input. Derived from NLTK.

These details have not been verified by PyPI

Project links

Homepage

Project description

https://github.com/aboutcode-org/pygmars

pygmars is a simple lexing and parsing library designed to craft lightweight lexers and parsers using regular expressions.

pygmars allows you to craft simple lexers that recognizes words based on regular expressions and identify sequences of words using lightweight grammars to obtain a parse tree.

The lexing task transforms a sequence of words or strings (e.g. already split in words) in a sequence of Token objects, assigning a label to each word and tracking their position and line number.

In particular, the lexing output is designed to be compatible with the output of Pygments lexers. It becomes possible to build simple grammars on top of existing Pygments lexers to perform lightweight parsing of the many (130+) programming languages supported by Pygments.

The parsing task transforms a sequence of Tokens in a parse Tree where each node in the tree is recognized and assigned a label. Parsing is using regular expression-based grammar rules applied to recognize Token sequences.

These rules are evaluated sequentially and not recursively: this keeps things simple and works very well in practice. This approach and the rules syntax has been battle-tested with NLTK from which pygmars is derived.

What about the name?

“pygmars” is a portmanteau of Pyg-ments and Gram-mars.

Origins

This library is based on heavily modified, simplified and remixed original code from NLTK regex POS tagger (renamed lexer) and regex chunker (renamed parser). The original usage of NLTK was designed by @savinosto parse copyrights statements in ScanCode Toolkit.

Users

pygmars is used by ScanCode Toolkit for copyright detection and for lightweight programming language parsing.

Why pygmars?

Why create this seemingly redundant library? Why not use NLTK directly?

NLTK has a specific focus on NLP and lexing/tagging and parsing using regexes is a tiny part of its overall feature set. These are part of rich set of taggers and parsers and implement a common API. We do not have the need for these richer APIs and they make evolving the API and refactoring the code difficult.
In particular NLTK POS tagging and chunking has been the engine used in ScanCode toolkit copyright and author detection and there are some improvements, simplifications and optimizations that would be difficult to implement in NLTK directly and unlikely to be accepted upstream. For instance, simplification of the code subset used for copyright detection enabled a big boost in performance. Improvements to track the Token lines and positions may not have been possible within the NLTK API.
Newer versions of NLTK have several extra required dependencies that we do not need. This is turn makes every tool heavier and complex when they only use this limited NLTK subset. By stripping unused NLTK code, we get a small and focused library with no dependencies.
ScanCode toolkit also needs lightweight parsing of several programming languages to extract metadata (such as dependencies) from package manifests. Some parsers have been built by hand (such as gemfileparser), or use the Python ast module (for Python setup.py), or they use existing Pygments lexers as a base. A goal of this library is to be enable building lightweight parsers reusing a Pygments lexer output as an input for a grammar. This is fairly different from NLP in terms of goals.

Theory of operations

A pygmars.lex.Lexer creates a sequence of pygmars.Token objects such as:

Token(value="for" label="KEYWORD", start_line=12, pos=4)

where the label is a symbol name assigned to this token.

A Token is a terminal symbol and the grammar is composed of rules where the left hand side is a label aka. a non-terminal symbol and the right hand side is a regular expression-like pattern over labels.

See https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols

A pygmars.parse.Parser is built from a pygmars.parse.Grammmar and calling its parse function transforms a sequence of Tokens in a pygmars.tree.Tree parse tree.

The grammar is composed of Rules and loaded from a text with one rule per line such as:

ASSIGNMENT: {<VARNAME> <EQUAL> <STRING|INT|FLOAT>} # variable assignment

Here the left hand side “ASSIGNMENT” label is produced when the right hand side sequence of Token labels “<VARNAME> <EQUAL> <STRING|INT|FLOAT>” is matched. “# variable assignment” is kept as a description for this rule.

License

SPDX-License-Identifier: Apache-2.0

Based on a substantially modified subset of the Natural Language Toolkit (NLTK) http://nltk.org/

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.0

Jul 16, 2025

0.9.0

Sep 4, 2024

0.8.1

Aug 14, 2024

0.8.0

Nov 25, 2022

0.7.0

Jul 26, 2021

0.7.0b3 pre-release

Jul 14, 2021

0.7.0b2 pre-release

Jul 11, 2021

0.6.0b1 pre-release

Jul 7, 2021

0.5.0

Jun 15, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygmars-1.0.0.tar.gz (86.8 kB view details)

Uploaded Jul 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pygmars-1.0.0-py3-none-any.whl (32.8 kB view details)

Uploaded Jul 16, 2025 Python 3

File details

Details for the file pygmars-1.0.0.tar.gz.

File metadata

Download URL: pygmars-1.0.0.tar.gz
Upload date: Jul 16, 2025
Size: 86.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pygmars-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`de5c6673941eb4c5965f219e64b6638d08237ed76aa7d412ee29819c90a93936`
MD5	`a63a1f3e8705b06555376d94ef5371f2`
BLAKE2b-256	`51e284b8d7329ae869f2c7ed8b9e50c035806e50b3b1977ae232fc1d78404644`

See more details on using hashes here.

File details

Details for the file pygmars-1.0.0-py3-none-any.whl.

File metadata

Download URL: pygmars-1.0.0-py3-none-any.whl
Upload date: Jul 16, 2025
Size: 32.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pygmars-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`69b75840f28ff5489de69b2604f100b0550a6ceee4e6aaefd60bc5d4b0025728`
MD5	`6afde30e0f1d81e440656e4aa2d9a78e`
BLAKE2b-256	`bb4b463043f8c9967cbeaa3e712c7d62e6d81a8bc66c43eb5f2e4bee28cdedb8`

See more details on using hashes here.

pygmars 1.0.0

Navigation

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What about the name?

Origins

Users

Why pygmars?

Theory of operations

License

Project details

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes