Skip to main content

Official Implementation of "COLLIE: Systematic Construction of Constrained Text Generation Tasks"

Project description

COLLIE: Systematic Construction of Constrained Text Generation Tasks (Website)

teaser

We propose the COLLIE framework for easy constraint structure specification, example extraction, instruction rendering, and model evaluation.

Install

We recommand using Python 3.9 (3.10 as of now might have incompatabilty of certain dependencies).

To install in development mode (in cloned project directory):

pip install -e .

After installation you can access the functionalities through import collie.

We will add COLLIE to PyPI soon.

Overview

There are two main ways to use COLLIE:

  1. Use the dataset we constructed to compare performance of your prompting/modeling methods to the ones reported in the paper
  2. Write your own constraints; make it harder, more compositional, etc to explore the limits of models and probe failure cases by following the steps in COLLIE framework

Dataset

The dataset used in the paper is at data/all_data.dill and can be loaded by

with open("data/all_data.dill", "rb") as f:
    all_data = dill.load(f)

all_data will be a dictionary with keys as the data source and constraint type, and values as a list of constraints. For example, all_data['wiki_c07'][0] is

{
    'example': 'Black market prices for weapons and ammunition in the Palestinian Authority-controlled areas have been rising, necessitating outside funding for the operation.', 
    'targets': ['have', 'rising', 'the'], 
    'constraint': ..., 
    'prompt': "Please generate a sentence containing the word 'have', 'rising', 'the'.", 
    ...
}

Reproducing the results reported in the paper:

  • Our model results can be found in logs/ folder
  • To plot the figures/tables in the paper, check out scripts/analysis.ipynb
  • To run the models to reproduce the results, run python scripts/run_api_models.py and python scripts/run_gpu_models.py

The COLLIE Framework

The framework follows a 4-step process:

  1. Constraint Specification
  2. Extraction
  3. Rendering
  4. Evaluation

Step 1: Constraint Specification (Complete Guide)

To specify a constraint, you need the following concepts defined as classes in collie/constraints.py:

  1. Level: deriving classes InputLevel (the basic unit of the input) and TargetLevel (the level for comparing to the target value); levels include 'character', 'word', 'sentence', etc
  2. Transformation: defines how the input text is modified into values comparable against the provided target value; it derives classes like Count, Position, ForEach, etc
  3. Logic: And, Or, All that can be used to combine constraints
  4. Relation: relation such as '==' or 'in' for compariing against the target value
  5. Reduction: when the target has multiple values, you need to specify how the transformed values from the input is reduced such as 'all', 'any', 'at least'
  6. Constraint: the main class for combining all the above specifications

To specify a constraint, you need to provide at least the TargetLevel, Transformation, and Relation. They are going to be wrapped in the c = Constraint(...) initialization. Once the constraint is specified, you can use c.check(input_text, target_value) to verify any given text and target tuple.

Below is an example of specifying a "counting the number of word constraint".

>>> from collie.constraints import Constraint, TargetLevel, Count, Relation

# A very simple "number of word" constraint.
>>> c = Constraint(
>>>     target_level=TargetLevel('word'),
>>>     transformation=Count(), 
>>>     relation=Relation('=='),
>>> )
>>> print(c)
Constraint(
    InputLevel(None),
    TargetLevel(word),
    Transformation('Count()'),
    Relation(==),
    Reduction(None)
)

Check out the guide to explore more examples.

Step 2: Extraction (Complete Guide)

Once the constraints are defined, you can now extract examples from the datasources (e.g., Gutenberg, Wikipedia) that satisfy the specified constraints.

To download necessary data files including the Gutenberg, dammit corpus to the data folder, run from the root project dir:

bash download.sh

Run extraction:

python -m collie.examples.extract

This will sweep over all constraints and data sources defined in collie/examples/. To add additional examples, you can add them to the appropriate python files. Extracted examples can be found in the folder sample_data. The files are named as: {source}_{level}.dill. The data/all_data.dill file is simply a concatenation of all these source-level dill files.

Step 3: Rendering

To render a constraint, simply run:

>>> from collie.constraint_renderer import ConstraintRenderer
>>> renderer = ConstraintRenderer(
>>>     constraint=c,  # Defined in step one
>>>     constraint_value=5
>>> )
>>> print(renderer.prompt)
Please generate a sentence with exactly 5 words.

Step 4: Evaluation

To check constraint satisfication, simply run:

>>> text = 'This is a good sentence.'
>>> print(c.check(text, 5))
True
>>> print(c.check(text, 4))
False

Citation

lease cite our paper if you use SimCSE in your work:

@inproceedings{yao2023collie,
    title = {COLLIE: Systematic Construction of Constrained Text Generation Tasks},
    author = {Yao, Shunyu and Chen, Howard and Wang, Austin and Yang, Runzhe and Narasimhan, Karthik},
    booktitle = {ArXiv},
    year = {2023},
    html = {}
}

License

MIT. Note that this is the license for our code, but each data source retains their own respective licenses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

collie-bench-0.1.0.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

collie_bench-0.1.0-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file collie-bench-0.1.0.tar.gz.

File metadata

  • Download URL: collie-bench-0.1.0.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for collie-bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f2c183b2dee5ae1791f47a19cadcd8be4702f2ebb0c0df6867d5275ffcf3a3d6
MD5 9803995dec83fe81bc3a9cbe6ada6a97
BLAKE2b-256 982ababc8d2684e3522f21f5b8f075fae72cb0c6aa85d63baf1a68700a68db83

See more details on using hashes here.

File details

Details for the file collie_bench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: collie_bench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for collie_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3cc5d59c70d91130f1d6b7cf67b01854fb648e69827f5fd98ef96600976f06d0
MD5 11b833266fac6dffe9bdfc8672068516
BLAKE2b-256 8841a6db844794033c5a8094508af72e4403320133e3835ffe43bc334cd5cd2d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page