Skip to main content

Utilities for regex and grammar parsing and constraining

Project description

Grammar utilities

This repository contains Python utilities (backed by Rust) to parse and constrain text with regular expressions and context-free grammars (LR(1)). Parsing is supported both for prefixes and full strings.

Context-free grammars already included in this repository are:

  • JSON
  • SPARQL

Installation

You can install the Python package from PyPI:

pip install grammar-utils

Only Linux is currently supported when installing from PyPI. Windows is causing some issues in CI so builds for this platform are not yet available.

Alternatively, you can clone this repository and build the package yourself:

git clone https://github.com/bastiscode/grammar-utils
cd grammar-utils
pip install maturin[patchelf]
maturin develop --release

Usage

Two use cases are supported by this library: parsing and constraining.

Parsing

Given a context-free grammar, parse a string and return the corresponding parse tree.

from grammar_utils.parse import load_lr1_parser

parser = load_lr1_parser("json")
tree = parser.parse('{"key": "value"}')
print(tree)
# you can calso get a pruned parse tree, skipping empty or collapsing single child nodes
pruned_tree = parser.parse('{"key": "value"}', skip_empty=True, collapse_single=True)
print(pruned_tree)

Parsing is also supported for prefixes, in which case the input should be a list of bytes and not a string. Here a tree for the already fixed terminals is returned, as well as the suffix of the input where we do not know yet what the next terminal is.

from grammar_utils.parse import load_lr1_parser

parser = load_lr1_parser("json")
tree, rest = parser.prefix_parse(b'{"key"")
print(tree)
print(rest)
# pruning is also supported here
pruned_tree, rest = parser.prefix_parse(b'{"key"', skip_empty=True, collapse_single=True)
print(pruned_tree)
print(rest)

You can also use your own grammars.

from grammar_utils.parse import LR1Parser

# define your own grammar and lexer
grammar = "..."
lexer = "..."
parser = LR1Parser(grammar, lexer, vocab)

Constraining

Constraints are used to check what symbols from the vocabulary can follow the current prefix such that the regular expression or context-free grammar can still be satisfied.

import random
from grammar_utils import load_byte_vocab
from grammar_utils.constrain import load_lr1_constraint, load_regex_constraint

vocab = load_byte_vocab()
constraint = load_lr1_constraint("json", vocab)
# reset constraint to a given prefix, default is an empty prefix
constraint.reset(b'{"key"')
# get the next possible symbols
next_indices = constraint.get()
# the indices refer to the vocabulary (decode only for human-readable strings)
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
# you can forward the constraint with a valid index
constraint.next(random.choice(next_indices))
# check if constraint is satisfied (should be False)
print(constraint.is_match())

# same for regular expressions
constraint = load_regex_constraint("boolean", vocab)
constraint.reset(b"tr")
next_indices = constraint.get()
# should only be 'u'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
print(constraint.is_match())
next_indices = constraint.get()
# should only be 'e'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
# should be True
print(constraint.is_match())

You can also use your own grammars and regexes.

from grammar_utils import load_byte_vocab
from grammar_utils.constrain import LR1Constraint, RegexConstraint

vocab = load_byte_vocab()

# define your own grammar and lexer
grammar = "..."
lexer = "..."
constraint = LR1Constraint(grammar, lexer, vocab)

# define your own regex
regex = "..."
constraint = RegexConstraint(regex, vocab)

Use cases

Forcing a language model to generate structured text

The following example shows how to use a regex constraint to force GPT2 to output either "true" or "false" after a given prompt.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from grammar_utils.constrain import load_regex_constraint

gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
vocab = [
    token.replace("Ġ", " ").encode()
    for token, _ in sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])
]
constraint = load_regex_constraint("boolean", vocab)
prefix = "Constrained decoding is cool: "
input_ids = tokenizer.encode(prefix)
while not (constraint.is_match() or constraint.is_invalid()):
    input_tensor = torch.tensor([input_ids])
    logits = gpt2(input_tensor).logits
    valid_indices = torch.from_numpy(constraint.get())
    valid_logits = logits[0, -1, valid_indices]
    index = valid_indices[torch.argmax(valid_logits)]
    constraint.next(index)
    input_ids.append(index)
    print(tokenizer.decode(input_ids))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grammar_utils-0.1.4.tar.gz (180.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

grammar_utils-0.1.4-cp310-abi3-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

grammar_utils-0.1.4-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

grammar_utils-0.1.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

grammar_utils-0.1.4-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.5 MB view details)

Uploaded CPython 3.10+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file grammar_utils-0.1.4.tar.gz.

File metadata

  • Download URL: grammar_utils-0.1.4.tar.gz
  • Upload date:
  • Size: 180.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for grammar_utils-0.1.4.tar.gz
Algorithm Hash digest
SHA256 677b3b9ddd836f6dced5a43e0a7b2ac7205ff2a6d75aa4617afcadca4ac1e244
MD5 bc898f69996442ef48f816a510a9fcd9
BLAKE2b-256 1e3b44273e2a14db965970989a3413016f463f1f26cf654f3b84d185e1edb7b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for grammar_utils-0.1.4.tar.gz:

Publisher: release.yml on bastiscode/grammar-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file grammar_utils-0.1.4-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for grammar_utils-0.1.4-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7515b8e52d957476e29a4a46c0334b2ee60ccd4b51175e894548a19570ad1203
MD5 01ba2f80720603e0bd581249b88eef98
BLAKE2b-256 a23b4aea0a73d159bebe80f299860a5a8bbc4fc89891f5ed37dba1a7f3681b57

See more details on using hashes here.

Provenance

The following attestation bundles were made for grammar_utils-0.1.4-cp310-abi3-win_amd64.whl:

Publisher: release.yml on bastiscode/grammar-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file grammar_utils-0.1.4-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for grammar_utils-0.1.4-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ecc83b93907f6c103739c7fe173384dceb71d741f590f79a2a7c6175a6c60760
MD5 a7b4c4eaeba2e26005006f8dcd92a2ce
BLAKE2b-256 b33ffef97fe1862e799571e2ec0256d4b37b137b84d19ebb64abc3aafd224cb3

See more details on using hashes here.

Provenance

The following attestation bundles were made for grammar_utils-0.1.4-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on bastiscode/grammar-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file grammar_utils-0.1.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for grammar_utils-0.1.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dc82becc933e384e28b67738f702b85ee7be925f8e3c911791a9bdea132841b2
MD5 d92a1ef3980305f0b2189d48bf1baafa
BLAKE2b-256 dcad7e6db6b4bb16ac5dadb3bb22cea815b9116de00bba81e2a24c24772a38c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for grammar_utils-0.1.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on bastiscode/grammar-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file grammar_utils-0.1.4-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for grammar_utils-0.1.4-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 dacd27154eb5dc5cc4f9b1d83f211bf7f45bdec6af5eb4ef8acb37e98578286a
MD5 c5051c31965ce2093e0aec05bb173cb4
BLAKE2b-256 b2522fc94b11d4c27c16e63f6ccb868e2103b0abea2e50e7dd86eb58f5536223

See more details on using hashes here.

Provenance

The following attestation bundles were made for grammar_utils-0.1.4-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release.yml on bastiscode/grammar-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page