Skip to main content

Efficient and multi-language generation from context free or sensitive grammars (CFG/CSG)

Project description

gramforge ⚒️

gramforge is a pythonic library for random (depth first) generation with declarative context-sensitive grammars (but also context free grammars) for synthetic data creation. One particularity is the option to generate in multiple languages in parallel (for example, tptp and pseudo-english).

Example with LogicNLI grammar with the gramforge DSL:
pip install gramforge

from gramforge import init_grammar, generate

def LogicNLI():
    ADJECTIVES = ['rich', 'quiet', 'old', 'tall', 'kind', 'brave', 'wise',
                  'happy', 'strong', 'curious', 'patient', 'funny', 'generous', 'humble']
    # (We selected adjectives with no clear semantic interference)
    NAMES = ['mary', 'paul', 'fred', 'alice', 'john', 'susan', 'lucy']

    R = init_grammar(['tptp','eng'])
    R('start(' + ','.join(['rule']*16) + ',' + ','.join(['fact']*8) + ')',
      '&\n'.join([f'({i})' for i in range(24)]),
      '\n'.join([f'{i}' for i in range(24)]))

    R('hypothesis(person,a)', '1(0)', '0 is 1')
    for a in ADJECTIVES:
        R('adj', a)
        R('adj', f'~{a}', f'not {a}', weight=0.2)

    R('property(adj,adj)', '(0(?)&1(?))', 'both 0 and 1')
    R('property(adj,adj)', '(0(?)|1(?))', '0 or 1')
    R('property(adj,adj)', '(0(?)<~>1(?))', 'either 0 or 1', weight=0.5)
    R('property(adj)', '0(?)', '0')

    R('rule(property,property)', '![X]:(0[?←X]=>1[?←X])',
      'everyone who is 0 is 1')
    R('rule(property,property)', '![X]:(0[?←X]<=>1[?←X])',
      'everyone who is 0 is 1 and vice versa')

    for p in NAMES:
        R('person', p)

    R('fact(person,property)', '1[?←0]', '0 is 1')
    R('fact(property)', '?[X]:(0[?←X])', 'someone is 0', weight=0.2)
    R('rule(fact,fact)', '(0)=>(1)', 'if 0 then 1')
    R('rule(fact,fact)', '(0)<=>(1)', 'if 0 then 1 and vice versa')
    return R

eng, tptp = "eng","tptp"
grammar = LogicNLI()
x=generate(grammar)
print(x@eng)
print(x@tptp)

Pre-loaded grammars

We feature pre-written grammars including:

  • tinypy_grammar reproducing the tinypy, a synthetic toy grammar of python for LLM training/evaluation
  • pygram_grammar ab advanced, state of the art python grammar, with valid functions, recursion, types, etc.
  • FOL_grammar a sophisticated controlled grammar for first order logic (tptp) aligned with simplified English
  • simple_english_grammar a subset of english
  • arith_grammar a simple grammar for arithmetics
  • regex_grammar a grammar generating regular expressions
  • dyck_grammar nested parentheses

Example:

from gramforge.grammars import FOL_grammar, tinypy_grammar
from gramforge import generate
g=tinypy_grammar()
x=generate(g)
print(x@'py')

Abstract syntax trees

Generated expressions (x.generate) behave like anytree trees, fully exposing the abstract syntax tree which can be helpful for debugging, visualization or analysis of the generated examples.

Depth constraints

Generating synthetic data requires complexity management. gramforge implements efficient management of min_depth and max_depth constraints, with a "bushiness" knob (default=0.7) preventing the generated expressions from generating "spikes" that just overfit the minimum depth requirement.

Citation for the gramforge framework:

@inproceedings{sileo-2024-scaling,
    title = "Scaling Synthetic Logical Reasoning Datasets with Context-Sensitive Declarative Grammars",
    author = "Sileo, Damien",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.301/",
    doi = "10.18653/v1/2024.emnlp-main.301",
    pages = "5275--5283",
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gramforge-1.0.10.tar.gz (54.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gramforge-1.0.10-py3-none-any.whl (64.6 kB view details)

Uploaded Python 3

File details

Details for the file gramforge-1.0.10.tar.gz.

File metadata

  • Download URL: gramforge-1.0.10.tar.gz
  • Upload date:
  • Size: 54.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for gramforge-1.0.10.tar.gz
Algorithm Hash digest
SHA256 f2272ca587e2f1cc1f382af1cf0a1a601fc79d42003275a1dabb751389991032
MD5 d667a900d82f39b101a208d538e2518b
BLAKE2b-256 f60a1e6d951c081c0f91c2ce17ce5dba2d68fcc112ff26adc5cb31ca34ca6c70

See more details on using hashes here.

File details

Details for the file gramforge-1.0.10-py3-none-any.whl.

File metadata

  • Download URL: gramforge-1.0.10-py3-none-any.whl
  • Upload date:
  • Size: 64.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for gramforge-1.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 77662921a74cb7e3d9f324ccebfebaf2341440c0f99155eb0b6d74e79c1506ff
MD5 9e89ca01bd87ccbde4cfd3ef783de1a1
BLAKE2b-256 2645f42bad1ac2ce4df7d33efeebc6be20aeb3b83c9ecf94595ad04efa1a77cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page