Skip to main content

A tool for generating data for SFT using LLM

Project description

SFT Data Generator

This is a tool for generating data for Supervised Fine-Tuning (SFT) using large language models like GPT-3.5. It can read data from CSV or Excel files and generate corresponding output data using OpenAI's API based on a given prompt.

Features

  • Supports reading CSV and Excel files
  • Uses OpenAI's Chat Completion API to generate data
  • Supports customizable prompts and models
  • Supports batch generation with customizable batch size
  • Supports setting the number of generation epochs, i.e., generating multiple rounds for the entire dataset
  • Supports JSON format output
  • Can set OpenAI API URL and key via command line arguments or environment variables

Usage

  1. Install dependencies:

    pip install -r requirements.txt
    
  2. Prepare the input data file (CSV or Excel) and prompt template.

  3. Run the command:

    sft-data-generator --file_path <input_file_path> --prompt <prompt_template> --model <model_name> [other optional arguments]
    

    or

    python data_generator.py --file_path <input_file_path> --prompt <prompt_template> --model <model_name> [other optional arguments]
    

    Required arguments:

    • --file_path: Path to the input data file, must be a CSV or Excel file.
    • --prompt: Prompt template used for generating data.
    • --model: Name of the OpenAI model used for generating data, e.g., gpt-3.5-turbo.

    Optional arguments:

    • --output_file: Output file path, defaults to input file path + .output.jsonl.
    • --batch_size: Batch size, i.e., the number of samples per concurrent request, defaults to 1.
    • --generate_epoch: Number of generation epochs, i.e., the number of rounds to generate for the entire dataset, defaults to 1.
    • --openai_base_url: Base URL for the OpenAI API, if not given, will use the environment variable OPENAI_API_BASE.
    • --openai_api_key: OpenAI API key, if not given, will use the environment variable OPENAI_API_KEY.
    • --json_output: Whether to use JSON format output, defaults to False.
  4. The generated data will be saved in the specified output file, with each line being a JSON object.

Example

sft-data-generator --file_path data.csv --prompt "Please generate a question-answer pair based on the given data:" --model gpt-3.5-turbo --batch_size 10 --generate_epoch 3

or

python data_generator.py --file_path data.csv --prompt "Please generate a question-answer pair based on the given data:" --model gpt-3.5-turbo --batch_size 10 --generate_epoch 3

This will read the data.csv file, use the prompt "Please generate a question-answer pair based on the given data:", call the gpt-3.5-turbo model with a batch size of 10, and generate 3 epochs of data. The output will be saved in the data.csv.output.jsonl file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sft-data-generator-0.0.3.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sft_data_generator-0.0.3-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file sft-data-generator-0.0.3.tar.gz.

File metadata

  • Download URL: sft-data-generator-0.0.3.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for sft-data-generator-0.0.3.tar.gz
Algorithm Hash digest
SHA256 0067840f1f69b02d616146216c32fe807aa797b3828d44d9c01dc0ca9979bde9
MD5 42cc6fa1b37a4712662ff9ec8dae40d3
BLAKE2b-256 e133253dda90dae6d2867392b26857129a610260e1a134d82fe43635d9240599

See more details on using hashes here.

File details

Details for the file sft_data_generator-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for sft_data_generator-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b9c961ed392a199e7c15ca8ee67c062a218e724934117f70f2028ef6c06cd7f6
MD5 a2a97cc6a654f48778323c45dc88d490
BLAKE2b-256 4fcadc182544055f0110bb628e26280143e2c2069bc3f341dca18ee94c1410a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page