Skip to main content

No project description provided

Project description

Agora-Logo

🏛️ Agora 🏛️

arXiv Hugging Face Organization License PyPI version

⚡ A repository for generating synthetic data with LLMs & evaluating LLMs' data generation capabilities 🚀 ⚡

Latest News 🔥

  • [2024/12] We release the Agora and Agora-Bench!
    • Agora-Bench covers 9 settings, measuring data generation capabilities across 3 domains and 3 data generation methods.
    • Agora is an easily customizable framework for data generation with LLMs.
    • Checkout our dataset, checkpoints, leaderboard, and the code!

What does Agora mean?

Agora-Logo

In ancient Athens, the Agora was a public space where citizens would gather to debate, share news, learn from each other, and listen to famous philosophers.

We made an analogy between data generators and teachers, where different generators teach student models using synthetic data in AgoraBench!

🔧 Installation

Installation with pip:

pip install data-agora

Project Structure 📁

Root Directory

.
├── agora_scripts/           # Scripts for converting and handling data formats
│   ├── prompts/            # Various prompt templates
│   └── run.py             # Main execution script
├── assets/                 # Project images and visual assets
├── libs/                   # Core libraries
│   └── data-agora/        # Main data processing library
│       ├── data_agora/    # Core data agora implementation
│       │   ├── core/      # Core functionality (LLMs, parsers, validators)
├── train/                  # Training related code (based on llama-recipes)
└── LICENSE

data-agora Library (libs/data-agora/)

  • Core implementation for data processing and handling
  • Includes LLM integrations (OpenAI, vLLM, etc.)
  • Parsers and validators for data processing
  • Serving capabilities for deployment

Agora Scripts (agora_scripts/)

  • Tools for data format conversion
  • Collection of prompt templates for different use cases
  • Main execution script for running the pipeline

Training (train/)

  • Based on Meta's llama-recipes repository
  • Contains training configurations and utilities

Usage Guide 🚀

Our library is convenient for two types of audiences:

  1. Testing an LM's Data Generation Capability with AgoraBench: Using the pre-built pipeline, you can easily measure the data generation capabilities of different LLMs.
  2. Custom Usage: You could customize the pipeline for your own tasks to generate large amounts of synthetic data.

Testing an LM's Data Generation Capability with AgoraBench

Step 1: Generate Data with Pre-built Pipeline

You could simply run the following script:

cd "./alchemy_scripts"

python3 run.py --method "instance_generation" --domain "math" --model_name "gpt-4o-mini-2024-07-18" --max_tokens 4096 --temperature 1.0 --num_instances 10000 --num_threads 4 --api_key ""
  • method should be either "instance_generation", "response_generation", or "quality_enhancement".

  • domain should be either "math", "general", "code'.

  • model_name should be exactly the same with how you call it on OpenAI API, LiteLLM, or vLLM.

  • The resulting dataset should look as follows:

[
   {
      "config": "",
      "instruction": "",
      "response": ""
   },
   [...]
]

Step 2: Upload the dataset to huggingface

You could use the following function:

from datasets import DatasetDict

def upload_to_huggingface(data, dataset_name, hf_key):
    dataset = Dataset.from_list(data)
    dataset_dict = DatasetDict({"train": dataset})
    api = HfApi()
    dataset_dict.push_to_hub(dataset_name, token=hf_key, private=True)

Step 3: Train Student Models with Synthetic Data

The following code is modified based on Meta's llama-recipes!

First, install the required packages

cd ./llama-recipes
pip3 install -r requirements.txt
pip3 install -e .
pip3 install wandb
wandb login
huggingface-cli login

Then, launch the following code.

gpu = 4
lr = 1e-5
checkpoint_dir = ""
hf_cache_dir = ""
hf_dataset_name = ""

torchrun --nnodes 1 --nproc_per_node $gpu \
        src/llama_recipes/finetuning.py \
        --model_name meta-llama/Meta-Llama-3.1-8B \
        --dist_checkpoint_root_folder "${checkpoint_dir}" \
        --dist_checkpoint_folder "${hf_dataset_name}" \
        --hf_cache_dir "${hf_cache_dir}" \
        --dataset "$hf_dataset_name" \
        --run_validation True \
        --context_length 4096 \
        --gradient_accumulation_steps 8 \
        --batching_strategy "packing" \
        --use_fast_kernels \
        --enable_fsdp \
        --pure_bf16 \
        --low_cpu_fsdp \
        --batch_size_training 2 \
        --num_epochs $num_epochs \
        --lr $lr \
        --weight_decay 0.01 \
        --use_wandb
  • You have to fill in:

    • checkpoint_dir (where the checkpoint is saved)
    • hf_cache_dir (where huggingface cache is saved)
    • hf_dataset_name (the dataset you uploaded on hf from Stage 1)
  • For uploading the checkpoint to huggingface, you could refer to this code.

Step 5: Evaluate Student Models and Measure Performance Gap Recovered (PGR)

For evaluating the trained student models, we used the following libraries:

  • AlpacaEval 2.0 (Instruction-following): link
  • Arena-Hard (Instruction-following): link
  • MBPP (Code): link
  • Human-Eval (Code): link

For GSM8K (Math) and MATH (Math), we implemented our custom code: TO BE ADDED

Custom Usage

For custom usage with different pipelines, parsing mechanisms, and validation logics, Alchemy supports convenient customization through abstract classes.

Prompt Loader: A class that prepares the meta-prompt passed to the data generator.

class CustomPromptLoader(InstanceGenerationPromptLoader):
   def __init__(self, prompt_template: str, seed_data: List[Dict], num_fewshot: int, placeholder_formats: Dict[str, str] = None, num_sample_from_seed_data: Optional[int] = None, [...]):
      super().__init__(prompt_template, seed_data, num_fewshot, placeholder_formats, num_sample_from_seed_data)
      [...]
    
    def prepare(self) -> PromptResult:
      [...]
      return PromptResult(prompt=prompt, metadata=metadata)

Parser: A class that separates the instruction and response from the data generator's output.

class CustomParser(Parser):

   def parse(self, prompt, teacher_model_output, placeholder_formats, [...]):
      [...]
      return {"instruction: instruction, "response": response}

Validator: A class that determines if the output is valid or not.

class CustomValidator(Validator):
   def validate(self, instruction: str, response: str, [...]):
      [...]
      if [...]:
        return True
      else:
        return False

All together

Then, you could write a script that utilizes the custom classes to generate data.

# MODIFY THE PLACEHOLDER FORMATS BASED ON YOUR PROMPT TEMPLATE
# Demonstration related placeholders are only used for instance generation
# Input Theme place holder is an example of a custom placeholder

placeholder_formats = {
    "demonstration_input_placeholder": "<input@>",
    "demonstration_output_placeholder": "<output@>",
    "test_input_placeholder": "<input>",
    "test_output_placeholder": "<output>",
    "test_input_trigger": "INPUT:",
    "test_output_trigger": "OUTPUT:",
    "stop_phrase": "[END]",
    "input_theme": "<input_theme>",
}


with open("", "r") as f:
    seed_data = json.load(f)

with open("", "r") as f:
    prompt_template = f.read()

llm = OpenAILLM(model_name="gpt-4o-mini-2024-07-18", api_key="")

prompt_loader = CustomPromptLoader(prompt_template=prompt_template, seed_data=seed_data, num_fewshot=3, placeholder_formats=placeholder_formats, num_sample_from_seed_data=2)
parser = CustomParser()
validator = CustomValidator()


sampling_params = {
    "max_tokens": args.max_tokens,
    "temperature": args.temperature,
    "top_p": 0.9,
    "stop": placeholder_formats["stop_phrase"]
}

agora = Agora(
    llm=llm,
    placeholder_formats=placeholder_formats,
    prompt_loader=prompt_loader,
    parser=parser,
    validator=validator,
    sampling_params=sampling_params
)

# Use cache_file to resume from previous results: The Alchemy class will automatically make a cache file "final_result.jsonl" for example
result = agora.run(num_instances=10000, num_threads=16, output_file="./results/final_result.json")
print(result[0])

Citation

If you find our work useful, please consider citing our paper!

@misc{kim2024evaluating,
      title={Evaluating Language Models as Synthetic Data Generators}, 
      author={Seungone Kim and Juyoung Suk and Xiang Yue and Vijay Viswanathan and Seongyun Lee and Yizhong Wang and Kiril Gashteovski and Carolin Lawrence and Sean Welleck and Graham Neubig},
      year={2024},
      eprint={2412.03679},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.03679}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_agora-0.1.2.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_agora-0.1.2-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file data_agora-0.1.2.tar.gz.

File metadata

  • Download URL: data_agora-0.1.2.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.1 Linux/5.15.0-88-generic

File hashes

Hashes for data_agora-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c149038522a43f68f243f7dc28997150a33d1c1bfa1d94f8a213ad5bca25af56
MD5 0d2b89798154af87498d05041b762993
BLAKE2b-256 d176e66eb597bf0f616d38018f52bd67348e08d0acc5706e116544b2182c6eb1

See more details on using hashes here.

File details

Details for the file data_agora-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: data_agora-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.1 Linux/5.15.0-88-generic

File hashes

Hashes for data_agora-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0b96bf219aa80edde096e5b9e92381af0ec330c236a8752820124b644af1562c
MD5 1ffbe882688afb91b2b004afd9a98b42
BLAKE2b-256 feaded462d0a0d8f598da1ef84b9908c933221a22a84a806cc35411c1680ea8e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page