Utilities for regex and grammar parsing and constraining
Project description
Grammar utilities
This repository contains Python utilities (backed by Rust) to parse and constrain text with regular expressions and context-free grammars (LR(1)). Parsing is supported both for prefixes and full strings.
Context-free grammars already included in this repository are:
- JSON
- SPARQL
Installation
You can install the Python package from PyPI:
pip install grammar-utils
Windows (x64) and Linux are currently supported when installing from PyPI.
Alternatively, you can clone this repository and build the package yourself:
git clone https://github.com/bastiscode/grammar-utils
cd grammar-utils
pip install maturin[patchelf]
maturin develop --release
Usage
Two use cases are supported by this library: parsing and constraining.
Parsing
Given a context-free grammar, parse a string and return the corresponding parse tree.
from grammar_utils.parse import load_lr1_parser
parser = load_lr1_parser("json")
tree = parser.parse('{"key": "value"}')
print(tree)
# you can calso get a pruned parse tree, skipping empty or collapsing single child nodes
pruned_tree = parser.parse('{"key": "value"}', skip_empty=True, collapse_single=True)
print(pruned_tree)
Parsing is also supported for prefixes, in which case the input should be a list of bytes and not a string. Here a tree for the already fixed terminals is returned, as well as the suffix of the input where we do not know yet what the next terminal is.
from grammar_utils.parse import load_lr1_parser
parser = load_lr1_parser("json")
tree, rest = parser.prefix_parse(b'{"key"")
print(tree)
print(rest)
# pruning is also supported here
pruned_tree, rest = parser.prefix_parse(b'{"key"', skip_empty=True, collapse_single=True)
print(pruned_tree)
print(rest)
You can also use your own grammars.
from grammar_utils.parse import LR1Parser
# define your own grammar and lexer
grammar = "..."
lexer = "..."
parser = LR1Parser(grammar, lexer, vocab)
Constraining
Constraints are used to check what symbols from the vocabulary can follow the current prefix such that the regular expression or context-free grammar can still be satisfied.
import random
from grammar_utils import load_byte_vocab
from grammar_utils.constrain import load_lr1_constraint, load_regex_constraint
vocab = load_byte_vocab()
constraint = load_lr1_constraint("json", vocab)
# reset constraint to a given prefix, default is an empty prefix
constraint.reset(b'{"key"')
# get the next possible symbols
next_indices = constraint.get()
# the indices refer to the vocabulary (decode only for human-readable strings)
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
# you can forward the constraint with a valid index
constraint.next(random.choice(next_indices))
# check if constraint is satisfied (should be False)
print(constraint.is_match())
# same for regular expressions
constraint = load_regex_constraint("boolean", vocab)
constraint.reset(b"tr")
next_indices = constraint.get()
# should only be 'u'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
print(constraint.is_match())
next_indices = constraint.get()
# should only be 'e'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
# should be True
print(constraint.is_match())
You can also use your own grammars and regexes.
from grammar_utils import load_byte_vocab
from grammar_utils.constrain import LR1Constraint, RegexConstraint
vocab = load_byte_vocab()
# define your own grammar and lexer
grammar = "..."
lexer = "..."
constraint = LR1Constraint(grammar, lexer, vocab)
# define your own regex
regex = "..."
constraint = RegexConstraint(regex, vocab)
Use cases
Forcing a language model to generate structured text
The following example shows how to use a regex constraint to force GPT2 to output either "true" or "false" after a given prompt.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from grammar_utils.constrain import load_regex_constraint
gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
vocab = [
token.replace("Ġ", " ").encode()
for token, _ in sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])
]
constraint = load_regex_constraint("boolean", vocab)
prefix = "Constrained decoding is cool: "
input_ids = tokenizer.encode(prefix)
while not (constraint.is_match() or constraint.is_invalid()):
input_tensor = torch.tensor([input_ids])
logits = gpt2(input_tensor).logits
valid_indices = torch.from_numpy(constraint.get())
valid_logits = logits[0, -1, valid_indices]
index = valid_indices[torch.argmax(valid_logits)]
constraint.next(index)
input_ids.append(index)
print(tokenizer.decode(input_ids))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file grammar_utils-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl
.
File metadata
- Download URL: grammar_utils-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.10+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5fe5f8dbcabab753e98ae14612228dc8ef40b4e69f8995a805e831274493d653 |
|
MD5 | ca16d864b1cd13a9533a546f2ca610e3 |
|
BLAKE2b-256 | 625f0dae0c7e16acbae5c78b9695e34945ae9bce93e1b720f81818b09ab59c29 |