Library for synthesizing data with LLMs

Project description

Agora-Logo

🏛️ Agora 🏛️

⚡ A repository for generating synthetic data with LLMs & evaluating LLMs' data generation capabilities 🚀 ⚡

Latest News 🔥

[2024/12] We release the Agora and Agora-Bench!
- Agora-Bench covers 9 settings, measuring data generation capabilities across 3 domains and 3 data generation methods.
- Agora is an easily customizable framework for data generation with LLMs.
- Checkout our dataset, checkpoints, leaderboard, and the code!

What does Agora mean?

Agora-Logo

In ancient Athens, the Agora was a public space where citizens would gather to debate, share news, learn from each other, and listen to famous philosophers.

We made an analogy between data generators and teachers, where different generators teach student models using synthetic data in AgoraBench!

🔧 Installation

Installation with pip:

pip install data-agora

Project Structure 📁

Root Directory

.
├── agora_scripts/           # Scripts for converting and handling data formats
│   ├── prompts/            # Various prompt templates
│   └── run.py             # Main execution script
├── assets/                 # Project images and visual assets
├── libs/                   # Core libraries
│   └── data-agora/        # Main data processing library
│       ├── data_agora/    # Core data agora implementation
│       │   ├── core/      # Core functionality (LLMs, parsers, validators)
├── train/                  # Training related code (based on llama-recipes)
└── LICENSE

data-agora Library (`libs/data-agora/`)

Core implementation for data processing and handling
Includes LLM integrations (OpenAI, vLLM, etc.)
Parsers and validators for data processing
Serving capabilities for deployment

Agora Scripts (`agora_scripts/`)

Tools for data format conversion
Collection of prompt templates for different use cases
Main execution script for running the pipeline

Training (`train/`)

Based on Meta's llama-recipes repository
Contains training configurations and utilities

Usage Guide 🚀

Our library is convenient for two types of audiences:

Testing an LM's Data Generation Capability with AgoraBench: Using the pre-built pipeline, you can easily measure the data generation capabilities of different LLMs.
Custom Usage: You could customize the pipeline for your own tasks to generate large amounts of synthetic data.

Testing an LM's Data Generation Capability with AgoraBench

Step 1: Generate Data with Pre-built Pipeline

You could simply run the following script:

cd "./alchemy_scripts"

python3 run.py --method "instance_generation" --domain "math" --model_name "gpt-4o-mini-2024-07-18" --max_tokens 4096 --temperature 1.0 --num_instances 10000 --num_threads 4 --api_key ""

method should be either "instance_generation", "response_generation", or "quality_enhancement".
domain should be either "math", "general", "code'.
model_name should be exactly the same with how you call it on OpenAI API, LiteLLM, or vLLM.
The resulting dataset should look as follows:

[
   {
      "config": "",
      "instruction": "",
      "response": ""
   },
   [...]
]

Step 2: Upload the dataset to huggingface

You could use the following function:

from datasets import DatasetDict

def upload_to_huggingface(data, dataset_name, hf_key):
    dataset = Dataset.from_list(data)
    dataset_dict = DatasetDict({"train": dataset})
    api = HfApi()
    dataset_dict.push_to_hub(dataset_name, token=hf_key, private=True)

Step 3: Train Student Models with Synthetic Data

The following code is modified based on Meta's llama-recipes!

First, install the required packages

cd ./llama-recipes
pip3 install -r requirements.txt
pip3 install -e .
pip3 install wandb
wandb login
huggingface-cli login

Then, launch the following code.

gpu = 4
lr = 1e-5
checkpoint_dir = ""
hf_cache_dir = ""
hf_dataset_name = ""

torchrun --nnodes 1 --nproc_per_node $gpu \
        src/llama_recipes/finetuning.py \
        --model_name meta-llama/Meta-Llama-3.1-8B \
        --dist_checkpoint_root_folder "${checkpoint_dir}" \
        --dist_checkpoint_folder "${hf_dataset_name}" \
        --hf_cache_dir "${hf_cache_dir}" \
        --dataset "$hf_dataset_name" \
        --run_validation True \
        --context_length 4096 \
        --gradient_accumulation_steps 8 \
        --batching_strategy "packing" \
        --use_fast_kernels \
        --enable_fsdp \
        --pure_bf16 \
        --low_cpu_fsdp \
        --batch_size_training 2 \
        --num_epochs $num_epochs \
        --lr $lr \
        --weight_decay 0.01 \
        --use_wandb

You have to fill in:
- checkpoint_dir (where the checkpoint is saved)
- hf_cache_dir (where huggingface cache is saved)
- hf_dataset_name (the dataset you uploaded on hf from Stage 1)
For uploading the checkpoint to huggingface, you could refer to this code.

Step 5: Evaluate Student Models and Measure Performance Gap Recovered (PGR)

For evaluating the trained student models, we used the following libraries:

AlpacaEval 2.0 (Instruction-following): link
Arena-Hard (Instruction-following): link
MBPP (Code): link
Human-Eval (Code): link

For GSM8K (Math) and MATH (Math), we implemented our custom code: TO BE ADDED

Custom Usage

For custom usage with different pipelines, parsing mechanisms, and validation logics, Alchemy supports convenient customization through abstract classes.

Prompt Loader: A class that prepares the meta-prompt passed to the data generator.

class CustomPromptLoader(InstanceGenerationPromptLoader):
   def __init__(self, prompt_template: str, seed_data: List[Dict], num_fewshot: int, placeholder_formats: Dict[str, str] = None, num_sample_from_seed_data: Optional[int] = None, [...]):
      super().__init__(prompt_template, seed_data, num_fewshot, placeholder_formats, num_sample_from_seed_data)
      [...]
    
    def prepare(self) -> PromptResult:
      [...]
      return PromptResult(prompt=prompt, metadata=metadata)

Parser: A class that separates the instruction and response from the data generator's output.

class CustomParser(Parser):

   def parse(self, prompt, teacher_model_output, placeholder_formats, [...]):
      [...]
      return {"instruction: instruction, "response": response}

Validator: A class that determines if the output is valid or not.

class CustomValidator(Validator):
   def validate(self, instruction: str, response: str, [...]):
      [...]
      if [...]:
        return True
      else:
        return False

All together

Then, you could write a script that utilizes the custom classes to generate data.

# MODIFY THE PLACEHOLDER FORMATS BASED ON YOUR PROMPT TEMPLATE
# Demonstration related placeholders are only used for instance generation
# Input Theme place holder is an example of a custom placeholder

placeholder_formats = {
    "demonstration_input_placeholder": "<input@>",
    "demonstration_output_placeholder": "<output@>",
    "test_input_placeholder": "<input>",
    "test_output_placeholder": "<output>",
    "test_input_trigger": "INPUT:",
    "test_output_trigger": "OUTPUT:",
    "stop_phrase": "[END]",
    "input_theme": "<input_theme>",
}


with open("", "r") as f:
    seed_data = json.load(f)

with open("", "r") as f:
    prompt_template = f.read()

llm = OpenAILLM(model_name="gpt-4o-mini-2024-07-18", api_key="")

prompt_loader = CustomPromptLoader(prompt_template=prompt_template, seed_data=seed_data, num_fewshot=3, placeholder_formats=placeholder_formats, num_sample_from_seed_data=2)
parser = CustomParser()
validator = CustomValidator()


sampling_params = {
    "max_tokens": args.max_tokens,
    "temperature": args.temperature,
    "top_p": 0.9,
    "stop": placeholder_formats["stop_phrase"]
}

agora = Agora(
    llm=llm,
    placeholder_formats=placeholder_formats,
    prompt_loader=prompt_loader,
    parser=parser,
    validator=validator,
    sampling_params=sampling_params
)

# Use cache_file to resume from previous results: The Alchemy class will automatically make a cache file "final_result.jsonl" for example
result = agora.run(num_instances=10000, num_threads=16, output_file="./results/final_result.json")
print(result[0])

Citation

If you find our work useful, please consider citing our paper!

@misc{kim2024evaluating,
      title={Evaluating Language Models as Synthetic Data Generators}, 
      author={Seungone Kim and Juyoung Suk and Xiang Yue and Vijay Viswanathan and Seongyun Lee and Yizhong Wang and Kiril Gashteovski and Carolin Lawrence and Sean Welleck and Graham Neubig},
      year={2024},
      eprint={2412.03679},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.03679}, 
}

Project details

Release history Release notifications | RSS feed

0.1.2

Dec 6, 2024

This version

0.1.1

Dec 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_agora-0.1.1.tar.gz (26.3 kB view details)

Uploaded Dec 6, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_agora-0.1.1-py3-none-any.whl (27.1 kB view details)

Uploaded Dec 6, 2024 Python 3

File details

Details for the file data_agora-0.1.1.tar.gz.

File metadata

Download URL: data_agora-0.1.1.tar.gz
Upload date: Dec 6, 2024
Size: 26.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.2

File hashes

Hashes for data_agora-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`350bb49eeae4532cc971ab5a869ad94d9f54ee3857e0a77a2d87b94b67ef8b9e`
MD5	`9d9e96913d180481e56dbc85b8bb4767`
BLAKE2b-256	`9e1362ba629579115f36ff6699b45cf6765b566be23f2207eafd3bb34e10fe61`

See more details on using hashes here.

File details

Details for the file data_agora-0.1.1-py3-none-any.whl.

File metadata

Download URL: data_agora-0.1.1-py3-none-any.whl
Upload date: Dec 6, 2024
Size: 27.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.2

File hashes

Hashes for data_agora-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80aa68ce6853846bafaae5bff30ce22b4dfd926422c0cb03d8f942e3def5fd3f`
MD5	`858a24b96e054391c7f404f8ad7dc564`
BLAKE2b-256	`5ebe57005db3a0027d13da9f2d2130f18ba2541c66d5c8fe99cbebfc6ab250d1`

See more details on using hashes here.

data-agora 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

🏛️ Agora 🏛️

Latest News 🔥

What does Agora mean?

🔧 Installation

Project Structure 📁

Root Directory

data-agora Library (libs/data-agora/)

Agora Scripts (agora_scripts/)

Training (train/)

Usage Guide 🚀

Testing an LM's Data Generation Capability with AgoraBench

Step 1: Generate Data with Pre-built Pipeline

Step 2: Upload the dataset to huggingface

Step 3: Train Student Models with Synthetic Data

Step 5: Evaluate Student Models and Measure Performance Gap Recovered (PGR)

Custom Usage

Prompt Loader: A class that prepares the meta-prompt passed to the data generator.

Parser: A class that separates the instruction and response from the data generator's output.

Validator: A class that determines if the output is valid or not.

All together

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

data-agora Library (`libs/data-agora/`)

Agora Scripts (`agora_scripts/`)

Training (`train/`)