Skip to main content

Utilities for regex and grammar parsing and constraining

Project description

Grammar utilities

This repository contains Python utilities (backed by Rust) to parse and constrain text with regular expressions and context-free grammars (LR(1)). Parsing is supported both for prefixes and full strings.

Context-free grammars already included in this repository are:

  • JSON
  • SPARQL

Installation

You can install the Python package from PyPI:

pip install grammar-utils

Only Linux is currently supported when installing from PyPI. Windows is causing some issues in CI so builds for this platform are not yet available.

Alternatively, you can clone this repository and build the package yourself:

git clone https://github.com/bastiscode/grammar-utils
cd grammar-utils
pip install maturin[patchelf]
maturin develop --release

Usage

Two use cases are supported by this library: parsing and constraining.

Parsing

Given a context-free grammar, parse a string and return the corresponding parse tree.

from grammar_utils.parse import load_lr1_parser

parser = load_lr1_parser("json")
tree = parser.parse('{"key": "value"}')
print(tree)
# you can calso get a pruned parse tree, skipping empty or collapsing single child nodes
pruned_tree = parser.parse('{"key": "value"}', skip_empty=True, collapse_single=True)
print(pruned_tree)

Parsing is also supported for prefixes, in which case the input should be a list of bytes and not a string. Here a tree for the already fixed terminals is returned, as well as the suffix of the input where we do not know yet what the next terminal is.

from grammar_utils.parse import load_lr1_parser

parser = load_lr1_parser("json")
tree, rest = parser.prefix_parse(b'{"key"")
print(tree)
print(rest)
# pruning is also supported here
pruned_tree, rest = parser.prefix_parse(b'{"key"', skip_empty=True, collapse_single=True)
print(pruned_tree)
print(rest)

You can also use your own grammars.

from grammar_utils.parse import LR1Parser

# define your own grammar and lexer
grammar = "..."
lexer = "..."
parser = LR1Parser(grammar, lexer, vocab)

Constraining

Constraints are used to check what symbols from the vocabulary can follow the current prefix such that the regular expression or context-free grammar can still be satisfied.

import random
from grammar_utils import load_byte_vocab
from grammar_utils.constrain import load_lr1_constraint, load_regex_constraint

vocab = load_byte_vocab()
constraint = load_lr1_constraint("json", vocab)
# reset constraint to a given prefix, default is an empty prefix
constraint.reset(b'{"key"')
# get the next possible symbols
next_indices = constraint.get()
# the indices refer to the vocabulary (decode only for human-readable strings)
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
# you can forward the constraint with a valid index
constraint.next(random.choice(next_indices))
# check if constraint is satisfied (should be False)
print(constraint.is_match())

# same for regular expressions
constraint = load_regex_constraint("boolean", vocab)
constraint.reset(b"tr")
next_indices = constraint.get()
# should only be 'u'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
print(constraint.is_match())
next_indices = constraint.get()
# should only be 'e'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
# should be True
print(constraint.is_match())

You can also use your own grammars and regexes.

from grammar_utils import load_byte_vocab
from grammar_utils.constrain import LR1Constraint, RegexConstraint

vocab = load_byte_vocab()

# define your own grammar and lexer
grammar = "..."
lexer = "..."
constraint = LR1Constraint(grammar, lexer, vocab)

# define your own regex
regex = "..."
constraint = RegexConstraint(regex, vocab)

Use cases

Forcing a language model to generate structured text

The following example shows how to use a regex constraint to force GPT2 to output either "true" or "false" after a given prompt.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from grammar_utils.constrain import load_regex_constraint

gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
vocab = [
    token.replace("Ġ", " ").encode()
    for token, _ in sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])
]
constraint = load_regex_constraint("boolean", vocab)
prefix = "Constrained decoding is cool: "
input_ids = tokenizer.encode(prefix)
while not (constraint.is_match() or constraint.is_invalid()):
    input_tensor = torch.tensor([input_ids])
    logits = gpt2(input_tensor).logits
    valid_indices = torch.from_numpy(constraint.get())
    valid_logits = logits[0, -1, valid_indices]
    index = valid_indices[torch.argmax(valid_logits)]
    constraint.next(index)
    input_ids.append(index)
    print(tokenizer.decode(input_ids))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grammar_utils-0.1.6.tar.gz (181.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

grammar_utils-0.1.6-cp310-abi3-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

grammar_utils-0.1.6-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

grammar_utils-0.1.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

grammar_utils-0.1.6-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.5 MB view details)

Uploaded CPython 3.10+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file grammar_utils-0.1.6.tar.gz.

File metadata

  • Download URL: grammar_utils-0.1.6.tar.gz
  • Upload date:
  • Size: 181.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for grammar_utils-0.1.6.tar.gz
Algorithm Hash digest
SHA256 3029680eaaf29c97a56a072b73c086d34419d0da469f517975c5971003cf35ca
MD5 20fab8c3ff1a5504df0ff81e60cf62d5
BLAKE2b-256 f32ab5d9c8680fcfec3ac3293fa47535463b9e0396664a2616bcc528454dae6b

See more details on using hashes here.

Provenance

The following attestation bundles were made for grammar_utils-0.1.6.tar.gz:

Publisher: release.yml on bastiscode/grammar-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file grammar_utils-0.1.6-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for grammar_utils-0.1.6-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f92f0ba1d6bdda0a47658ec3cdcf9a0315b602b2a588cb23bd431ce7d2fc64d6
MD5 80e6120103b311711f802c3e3f19be25
BLAKE2b-256 0fa3d1e654b213c180ae6798483c9c67810e4a84afdb098a1d40d4eaf6d9534d

See more details on using hashes here.

Provenance

The following attestation bundles were made for grammar_utils-0.1.6-cp310-abi3-win_amd64.whl:

Publisher: release.yml on bastiscode/grammar-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file grammar_utils-0.1.6-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for grammar_utils-0.1.6-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 609dcef5b6170d256d1306f0c24a76e4bd124966ff9b24b5f5f5491ce2e0bd6f
MD5 c659b75cb9f11d98aaf0c99ccbee6bf9
BLAKE2b-256 fe0fdc993869986e1c6ee14bda263bd281a1e701552d0acdfaf858b78b141f1b

See more details on using hashes here.

Provenance

The following attestation bundles were made for grammar_utils-0.1.6-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on bastiscode/grammar-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file grammar_utils-0.1.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for grammar_utils-0.1.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1d54b443a89d53985f3bd3f2cbfb9960c170aa6e36e683f5aed7545790f6d572
MD5 5ce0b764036e2808a9bb849f29dad528
BLAKE2b-256 6649519819141cd275aa19c88f772cb10a24ab44e2947a6001de05de4478f707

See more details on using hashes here.

Provenance

The following attestation bundles were made for grammar_utils-0.1.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on bastiscode/grammar-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file grammar_utils-0.1.6-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for grammar_utils-0.1.6-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 6e07505182a1d6f8c0a125d9c6a318dab4a34a8c1a0e7a36ec72b56b8363e46f
MD5 c1c62ed5f0384fdd9998f70ee1e6e45a
BLAKE2b-256 e43868e41e21aa7973a9891469277b8a2799625d45d2b3c3e24f565c963ca92c

See more details on using hashes here.

Provenance

The following attestation bundles were made for grammar_utils-0.1.6-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release.yml on bastiscode/grammar-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page