Synthesizing realistic and diverse text-datasets from augmented LLMs.
Project description
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
This repository contains the implementation of the paper "SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation" (https://arxiv.org/abs/2405.10040)
Our proposed approach is below. Refer Algorithm 1 in the paper for details: https://arxiv.org/abs/2405.10040
Installing dependencies
We recommend installing required dependencies in a new Conda environment using the commands below.
These commands were tested to work on Deep Learning AMI GPU PyTorch 1.13.1 (Amazon Linux 2) 20230221 from AWS.
Install dependencies:
conda create -n synthesizrr python=3.11.8 --yes
conda activate synthesizrr
pip install uv ## For super-fast installation
uv pip install -r requirements.txt
uv pip install "spacy==3.7.4" "spacy-transformers==1.3.5"
uv pip install "setuptools==69.5.1"
python -m spacy download en_core_web_lg
python -c "import nltk; nltk.download('punkt');"
Code structure
synthesizrr/base/ contains utility functions and classes.
synthesizrr/expts/ contains code to reproduce the experiments.
Running the code
-
Setup
DATA_DIR:- Download the datasets into a local folder
DATA_DIR. - Inside
synthesizrr/expt/data.py, set the variableDATA_DIR(marked TODO) to the above folder.
- Download the datasets into a local folder
-
Setup
CORPUS_DIR:- Download the corpora into a folder
CORPUS_DIR. - We recommend using S3 for this since the corpora are large.
- Inside
synthesizrr/expt/corpus.py, set the variableCORPUS_DIR(marked TODO) to the above folder.
- Download the corpora into a folder
-
Setup
RESULTS_DIR:- Inside
synthesizrr/expt/common.py, set the variableRESULTS_DIR(marked with TODO) to a different folder. Intermediate datasets and metrics will be saved here. - We recommend using S3 for this since the file-paths are long.
- Inside
-
Start a Ray cluster:
- On the Ray head node, run:
ray start --head - On the Ray worker nodes, run
ray start --address='<head node IP address>:6379' - At the top of the files
data.py,corpus.py,main.py, add the following to connect to the Ray cluster:
- On the Ray head node, run:
import synthesizrr
import ray
from ray.util.dask import ray_dask_get, enable_dask_on_ray, disable_dask_on_ray
from pprint import pprint
pprint(ray.init(
address='ray://<head node IP address>:10001', ## MODIFY THIS
ignore_reinit_error=True,
_temp_dir=str('/tmp/ray/'),
runtime_env={"py_modules": [
synthesizrr,
]},
))
enable_dask_on_ray()
pprint(ray.cluster_resources()) ## Shows you number of cpus and gpus to make sure it is setup properly.
- After modifying the code to set
DATA_DIR,CORPUS_DIRandRESULTS_DIR, and starting the Ray cluster, run the following:- First, run
cd synthesizrr/expts/ && python3 data.pyto create the datasets. (You will need to download certain datasets toDATA_DIRfolder beforehand). - Next, run
cd synthesizrr/expts/ && python3 corpus.pyto create the corpora (warning, this step needs a lot of compute! Make sure you setup the Ray cluster and use a big machine with at least a few hundred GB of RAM as the head node). - Finally, run the file
cd synthesizrr/expts/ && python3 main.pyto reproduce the experiments.
- First, run
Security
See CONTRIBUTING for more information.
License
This project is licensed under the Apache-2.0 License.
Citing
If you use or refer to this code in another publication, please cite it using the Bibtex below:
@misc{divekar2024synthesizrr,
title={SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation},
author={Abhishek Divekar and Greg Durrett},
year={2024},
eprint={2405.10040},
archivePrefix={arXiv}
}
Acknowledgements
The compute infrastructure used for these experiments was financially supported by the Amazon Central Machine Learning department.
The following people contributed to the design or implemented smaller components in this codebase:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synthesizrr-0.1.0.tar.gz.
File metadata
- Download URL: synthesizrr-0.1.0.tar.gz
- Upload date:
- Size: 433.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
291d8fde133633a3c875d9358563f1cee16413e4ba403f2174e40ed887a7d684
|
|
| MD5 |
41a3c3064136b4207e507f30722a1007
|
|
| BLAKE2b-256 |
7f8b5f95519e015ca78e866f193b94fd9dae0cbeac14e8dcfbc0b0f39193cc39
|
File details
Details for the file synthesizrr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: synthesizrr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 74.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db278f534f994e73a0ac200058078267c7638bc51e78f1b175e5ede5227dc590
|
|
| MD5 |
5c9131d93962a8ae53e7e1f4a4270142
|
|
| BLAKE2b-256 |
a0ddac92f0c9f90fa2d14618390bd7e9483145a8be002199944e18e2ada32e2a
|