Enforce the output format (JSON Schema, Regex etc) of a language model

These details have not been verified by PyPI

Project links

Project description

lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model

Solution at a glance

Language models are able to generate text, but when requiring a precise output format, they do not always perform as instructed. Various prompt engineering techniques have been introduced to improve the robustness of the generated text, but they are not always sufficient. This project solves the issues by filtering the tokens that the language model is allowed to generate at every timestep, thus ensuring that the output format is respected, while minimizing the limitations on the language model.

Installation

pip install lm-format-enforcer

Simple example

from pydantic import BaseModel
from lmformatenforcer import JsonSchemaParser, generate_enforced

class AnswerFormat(BaseModel):
    first_name: str
    last_name: str
    year_of_birth: int
    num_seasons_in_nba: int

question = f'Please give me information about Michael Jordan. You MUST answer using the following json schema: {AnswerFormat.schema_json()}'
parser = JsonSchemaParser(AnswerFormat.schema())

# Call generate_enforced(model, tokenizer, parser, ...) instead of model.generate(...):
inputs = tokenizer([question], return_tensors='pt', add_special_tokens=False, return_token_type_ids=False).to(device)
result = generate_enforced(model, tokenizer, parser, inputs=inputs)
print(result)
# {'first_name': 'Michael', 'last_name': 'Jordan', 'year_of_birth': 1963, 'num_seasons_in_nba': 15}

Capabilities / Advantages

Works with any language model and tokenizer (currently works with transformers, can be adapted into any python language model framework)
Supports both JSON Schema (strong) and Regular Expression (limited) formats
Supports both required and optional fields in JSON schemas
Supports nested fields, arrays and dictionaries in JSON schemas
Gives the language model freedom to control whitespacing and field ordering in JSON schemas, reducing hallucinations

Detailed example

We created a Google Colab Notebook which contains a full example of how to use this library to enforce the output format of llama2, including interpreting the intermediate results. The notebook can run on a free GPU-backed runtime in Colab.

You can also view the notebook in GitHub.

How does it work?

The library works by combining a character level parser and a tokenizer prefix tree into a smart token filtering mechanism.

An example of the character level parser and tokenizer prefix tree in a certain timestep

Character Level Parser

Parsing a string into any kind of formatter can be looked at as an implicit tree structure - at any moment in the parsing process, there is a set of allowed next characters, and if any of them are selected, there is a new set of allowed next characters, and so on.

CharacterLevelParser is an interface for parsing according to this implicit structure. add_character() and get_allowed_characters() can be seen as tree traversal methods.

There are several implementations of this interface:

JsonSchemaParser - parses according to a json schema.
StringParser - forces an exact string (used mainly for diagnostics)
RegexParser - parses according to a regular expression. Note that this cannot use the built in python regex and uses a manually implemented one (https://github.com/xysun/regex), so it has very limited capabilities.

Tokenizer Prefix Tree

Given a tokenizer used by a certain language model, we can build a prefix tree of all the tokens that the language model can generate. This is done by generating all possible sequences of tokens, and adding them to the tree. See TokenizerPrefixTree

Combining the two

Given a character level parser and a tokenizer prefix tree, we can elegantly and efficiently filter the tokens that the language model is allowed to generate at the next timestep: We only traverse the characters that are in BOTH the character level parsing node and the tokenizer prefix tree node. This allows us to find all of the tokens (including complex subword tokens such as "," which are critical in JSON parsing). We do this recursively on both trees and return all of the allowed tokens. When the language model generates a token, we advance the character level parser according to the new characters, ready to filter the next timestep.

Diagnostics - Will I always get good results?

Using this library guarantees that the output will match the format, but it does not guarantee that the output will be semantically correct. Forcing the language model to conform to a certain output may lead to increased hallucinations. Guiding the model via prompt engineering is still likely to improve results.

In order to help you understand the aggressiveness caused by the format enforcement, if you pass output_scores=True and return_dict_in_generate=True in the kwargs to generate_enforced() (these are existing optional parameters in the transformers library), you will also get a token-by-token dataframe showing which token was selected, its score, and what was the token that would have been chosen if the format enforcement was not applied. If you see that the format enforcer forced the language model to select tokens with very low weights, it is a likely contributor to the poor results. Try modifying the prompt to guide the language model to not force the format enforcer to be so aggressive.

Example using the regular expression format Michael Jordan was Born in (\d)+.

   generated_token  generated_token_idx  generated_score leading_token  leading_token_idx  leading_score
0                ▁                29871         1.000000             ▁              29871       1.000000
1          Michael                24083         0.000027         ▁Sure              18585       0.959473
2          ▁Jordan                18284         1.000000       ▁Jordan              18284       1.000000
3             ▁was                  471         1.000000          ▁was                471       1.000000
4            ▁Born                19298         0.000008         ▁born               6345       1.000000
5              ▁in                  297         0.994629           ▁in                297       0.994629
6                ▁                29871         0.982422             ▁              29871       0.982422
7                1                29896         1.000000             1              29896       1.000000
8                9                29929         1.000000             9              29929       1.000000
9                6                29953         1.000000             6              29953       1.000000
10               3                29941         1.000000             3              29941       1.000000
11               .                29889         0.999512             .              29889       0.999512
12            </s>                    2         0.981445          </s>                  2       0.981445

You can see that the model "wanted" to start the answer using Sure, but the format enforcer forced it to use Michael - there was a big gap in token 1. Afterwards, the leading scores are all within the allowed token set, meaning the model likely did not hallucinate due to the token forcing.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.10.9

Oct 16, 2024

0.10.8

Oct 16, 2024

0.10.7

Sep 7, 2024

0.10.6

Aug 5, 2024

0.10.5

Jul 27, 2024

0.10.4

Jul 15, 2024

0.10.3

Jun 20, 2024

0.10.2

May 17, 2024

0.10.1

May 4, 2024

0.9.10

May 3, 2024

0.9.9

Apr 24, 2024

0.9.8

Apr 20, 2024

0.9.7

Apr 20, 2024

0.9.6

Apr 19, 2024

0.9.5

Apr 12, 2024

0.9.4

Apr 12, 2024

0.9.3

Mar 13, 2024

0.9.2

Mar 1, 2024

0.9.1

Feb 28, 2024

0.9.0

Feb 19, 2024

0.8.3

Feb 2, 2024

0.8.2

Jan 10, 2024

0.8.1

Jan 3, 2024

0.8.0

Dec 20, 2023

0.7.3

Dec 16, 2023

0.7.2

Dec 6, 2023

0.7.1

Nov 21, 2023

0.7.0

Nov 20, 2023

0.6.5

Nov 19, 2023

0.6.4

Nov 15, 2023

0.6.3

Nov 12, 2023

0.6.2

Nov 9, 2023

0.6.1

Nov 6, 2023

0.6.0

Nov 6, 2023

0.5.2

Nov 5, 2023

0.5.1

Nov 2, 2023

0.5.0

Nov 1, 2023

0.4.3

Oct 30, 2023

0.4.2

Oct 29, 2023

0.4.1

Oct 26, 2023

0.4.0

Oct 26, 2023

0.3.8

Oct 25, 2023

0.3.7

Oct 19, 2023

0.3.6

Oct 19, 2023

0.3.5

Oct 18, 2023

0.3.4

Oct 17, 2023

0.3.3

Oct 17, 2023

0.3.2

Oct 17, 2023

0.3.1

Oct 15, 2023

0.3.0

Oct 12, 2023

0.2.4

Oct 11, 2023

0.2.3

Oct 10, 2023

0.2.2

Oct 10, 2023

0.2.1

Oct 9, 2023

0.1.10

Oct 3, 2023

0.1.9

Oct 3, 2023

This version

0.1.8

Oct 3, 2023

0.1.7

Oct 3, 2023

0.1.6

Oct 2, 2023

0.1.5

Oct 2, 2023

0.1.4

Oct 1, 2023

0.1.3

Oct 1, 2023

0.1.2

Oct 1, 2023

0.1.1

Sep 28, 2023

0.1.0

Sep 28, 2023

0.0.dev7 pre-release

Sep 28, 2023

0.0.dev6 pre-release

Sep 28, 2023

0.0.dev5 pre-release

Sep 28, 2023

0.0.dev4 pre-release

Sep 28, 2023

0.0.dev3 pre-release

Sep 27, 2023

0.0.dev2 pre-release

Sep 27, 2023

0.0.dev1 pre-release

Sep 27, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lm_format_enforcer-0.1.8.tar.gz (20.6 kB view details)

Uploaded Oct 3, 2023 Source

Built Distribution

lm_format_enforcer-0.1.8-py3-none-any.whl (23.8 kB view details)

Uploaded Oct 3, 2023 Python 3

File details

Details for the file lm_format_enforcer-0.1.8.tar.gz.

File metadata

Download URL: lm_format_enforcer-0.1.8.tar.gz
Upload date: Oct 3, 2023
Size: 20.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.10.12 Linux/5.15.90.1-microsoft-standard-WSL2

File hashes

Hashes for lm_format_enforcer-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`ac4786b040a25df813e4f34bde6f85027482f26a089ab40686ca0885913ae7b4`
MD5	`8b8636bee7d61a2da8a93db679687bfc`
BLAKE2b-256	`efc76ced84e427c95220d2b7e5a509ea7bfa2786018666e370982292d8eba113`

See more details on using hashes here.

File details

Details for the file lm_format_enforcer-0.1.8-py3-none-any.whl.

File metadata

Download URL: lm_format_enforcer-0.1.8-py3-none-any.whl
Upload date: Oct 3, 2023
Size: 23.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.10.12 Linux/5.15.90.1-microsoft-standard-WSL2

File hashes

Hashes for lm_format_enforcer-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`df0f2fd7f448871b60dc43cdc46e2e867e430f278036c2c7cbf80582779e0d9a`
MD5	`d22fd6dbdb70cc704e67a9d31087fde2`
BLAKE2b-256	`ec817da38fe9fb9c24c626f78d49269ad8556005a5af5ae4fd42d13de16c47cd`