Skip to main content

A tool for generating data for SFT using LLM

Project description

SFT Data Generator

This is a tool for generating data for Supervised Fine-Tuning (SFT) using large language models like GPT-3.5. It can read data from CSV or Excel files and generate corresponding output data using OpenAI's API based on a given prompt.

Features

  • Supports reading CSV and Excel files
  • Uses OpenAI's Chat Completion API to generate data
  • Supports customizable prompts and models
  • Supports batch generation with customizable batch size
  • Supports setting the number of generation epochs, i.e., generating multiple rounds for the entire dataset
  • Supports JSON format output
  • Can set OpenAI API URL and key via command line arguments or environment variables

Usage

  1. Install dependencies:

    pip install -r requirements.txt
    
  2. Prepare the input data file (CSV or Excel) and prompt template.

  3. Run the command:

    python data_generator.py --file_path <input_file_path> --prompt <prompt_template> --model <model_name> [other optional arguments]
    

    Required arguments:

    • --file_path: Path to the input data file, must be a CSV or Excel file.
    • --prompt: Prompt template used for generating data.
    • --model: Name of the OpenAI model used for generating data, e.g., gpt-3.5-turbo.

    Optional arguments:

    • --output_file: Output file path, defaults to input file path + .output.jsonl.
    • --batch_size: Batch size, i.e., the number of samples per concurrent request, defaults to 1.
    • --generate_epoch: Number of generation epochs, i.e., the number of rounds to generate for the entire dataset, defaults to 1.
    • --openai_base_url: Base URL for the OpenAI API, if not given, will use the environment variable OPENAI_API_BASE.
    • --openai_api_key: OpenAI API key, if not given, will use the environment variable OPENAI_API_KEY.
    • --json_output: Whether to use JSON format output, defaults to False.
  4. The generated data will be saved in the specified output file, with each line being a JSON object.

Example

python data_generator.py --file_path data.csv --prompt "Please generate a question-answer pair based on the given data:" --model gpt-3.5-turbo --batch_size 10 --generate_epoch 3

This will read the data.csv file, use the prompt "Please generate a question-answer pair based on the given data:", call the gpt-3.5-turbo model with a batch size of 10, and generate 3 epochs of data. The output will be saved in the data.csv.output.jsonl file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sft-data-generator-0.0.2.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sft_data_generator-0.0.2-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file sft-data-generator-0.0.2.tar.gz.

File metadata

  • Download URL: sft-data-generator-0.0.2.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for sft-data-generator-0.0.2.tar.gz
Algorithm Hash digest
SHA256 dab2b0b3922ea58edbcfd317037135a4fdbd4016a788a1b3fbc71c5a79c5e2c9
MD5 14e59887c44955e3687846991a112e0a
BLAKE2b-256 c2a883be7a9cf2dac7531b9bacb8b9368fd5c53eb2ae884848244255085e821b

See more details on using hashes here.

File details

Details for the file sft_data_generator-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for sft_data_generator-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4453d3efde8d52169f3c21812930ab242f600fd45f64798458d0851c02cb3bfc
MD5 c37e1434f4afc158e43722db0cf85f4f
BLAKE2b-256 8d4338ae1ea1c84da3f327527e5a17133a56770c6cc51fd8199da691e4dfa15d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page