Skip to main content

Synthesizing realistic and diverse text-datasets from augmented LLMs.

Project description

SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

This repository contains the implementation of the paper "SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation" (https://arxiv.org/abs/2405.10040)

Our proposed approach is below. Refer Algorithm 1 in the paper for details: https://arxiv.org/abs/2405.10040

SynthesizRR High Level Diagram

Installing dependencies

We recommend installing required dependencies in a new Conda environment using the commands below.

These commands were tested to work on Deep Learning AMI GPU PyTorch 1.13.1 (Amazon Linux 2) 20230221 from AWS.

Install dependencies:

conda create -n synthesizrr python=3.11.8 --yes  
conda activate synthesizrr 
pip install uv   ## For super-fast installation

uv pip install -r requirements.txt

uv pip install "spacy==3.7.4" "spacy-transformers==1.3.5"
uv pip install "setuptools==69.5.1"

python -m spacy download en_core_web_lg
python -c "import nltk; nltk.download('punkt');"

Code structure

synthesizrr/base/ contains utility functions and classes.

synthesizrr/expts/ contains code to reproduce the experiments.

Running the code

  1. Setup DATA_DIR:

    • Download the datasets into a local folder DATA_DIR.
    • Inside synthesizrr/expt/data.py, set the variable DATA_DIR (marked TODO) to the above folder.
  2. Setup CORPUS_DIR:

    • Download the corpora into a folder CORPUS_DIR.
    • We recommend using S3 for this since the corpora are large.
    • Inside synthesizrr/expt/corpus.py, set the variable CORPUS_DIR (marked TODO) to the above folder.
  3. Setup RESULTS_DIR:

    • Inside synthesizrr/expt/common.py, set the variable RESULTS_DIR (marked with TODO) to a different folder. Intermediate datasets and metrics will be saved here.
    • We recommend using S3 for this since the file-paths are long.
  4. Start a Ray cluster:

    • On the Ray head node, run: ray start --head
    • On the Ray worker nodes, run ray start --address='<head node IP address>:6379'
    • At the top of the files data.py, corpus.py, main.py, add the following to connect to the Ray cluster:
import synthesizrr
import ray
from ray.util.dask import ray_dask_get, enable_dask_on_ray, disable_dask_on_ray
from pprint import pprint
pprint(ray.init(
    address='ray://<head node IP address>:10001',  ## MODIFY THIS
    ignore_reinit_error=True,
    _temp_dir=str('/tmp/ray/'),
    runtime_env={"py_modules": [
        synthesizrr,
    ]},
))
enable_dask_on_ray()
pprint(ray.cluster_resources())  ## Shows you number of cpus and gpus to make sure it is setup properly.
  1. After modifying the code to set DATA_DIR, CORPUS_DIR and RESULTS_DIR, and starting the Ray cluster, run the following:
    • First, run cd synthesizrr/expts/ && python3 data.py to create the datasets. (You will need to download certain datasets to DATA_DIR folder beforehand).
    • Next, run cd synthesizrr/expts/ && python3 corpus.py to create the corpora (warning, this step needs a lot of compute! Make sure you setup the Ray cluster and use a big machine with at least a few hundred GB of RAM as the head node).
    • Finally, run the file cd synthesizrr/expts/ && python3 main.py to reproduce the experiments.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Citing

If you use or refer to this code in another publication, please cite it using the Bibtex below:

@misc{divekar2024synthesizrr,
      title={SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation}, 
      author={Abhishek Divekar and Greg Durrett},
      year={2024},
      eprint={2405.10040},
      archivePrefix={arXiv}
}

Acknowledgements

The compute infrastructure used for these experiments was financially supported by the Amazon Central Machine Learning department.

The following people contributed to the design or implemented smaller components in this codebase:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthesizrr-0.1.0.tar.gz (433.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthesizrr-0.1.0-py3-none-any.whl (74.7 kB view details)

Uploaded Python 3

File details

Details for the file synthesizrr-0.1.0.tar.gz.

File metadata

  • Download URL: synthesizrr-0.1.0.tar.gz
  • Upload date:
  • Size: 433.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.11

File hashes

Hashes for synthesizrr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 291d8fde133633a3c875d9358563f1cee16413e4ba403f2174e40ed887a7d684
MD5 41a3c3064136b4207e507f30722a1007
BLAKE2b-256 7f8b5f95519e015ca78e866f193b94fd9dae0cbeac14e8dcfbc0b0f39193cc39

See more details on using hashes here.

File details

Details for the file synthesizrr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: synthesizrr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 74.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.11

File hashes

Hashes for synthesizrr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 db278f534f994e73a0ac200058078267c7638bc51e78f1b175e5ede5227dc590
MD5 5c9131d93962a8ae53e7e1f4a4270142
BLAKE2b-256 a0ddac92f0c9f90fa2d14618390bd7e9483145a8be002199944e18e2ada32e2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page