Skip to main content

A tool for generating data for SFT using LLM

Project description

SFT Data Generator

This is a tool for generating data for Supervised Fine-Tuning (SFT) using large language models like GPT-3.5. It can read data from CSV or Excel files and generate corresponding output data using OpenAI's API based on a given prompt.

Features

  • Supports reading CSV and Excel files
  • Uses OpenAI's Chat Completion API to generate data
  • Supports customizable prompts and models
  • Supports batch generation with customizable batch size
  • Supports setting the number of generation epochs, i.e., generating multiple rounds for the entire dataset
  • Supports JSON format output
  • Can set OpenAI API URL and key via command line arguments or environment variables

Usage

  1. Install dependencies:

    pip install -r requirements.txt
    
  2. Prepare the input data file (CSV or Excel) and prompt template.

  3. Run the command:

    python data_generator.py --file_path <input_file_path> --prompt <prompt_template> --model <model_name> [other optional arguments]
    

    Required arguments:

    • --file_path: Path to the input data file, must be a CSV or Excel file.
    • --prompt: Prompt template used for generating data.
    • --model: Name of the OpenAI model used for generating data, e.g., gpt-3.5-turbo.

    Optional arguments:

    • --output_file: Output file path, defaults to input file path + .output.jsonl.
    • --batch_size: Batch size, i.e., the number of samples per concurrent request, defaults to 1.
    • --generate_epoch: Number of generation epochs, i.e., the number of rounds to generate for the entire dataset, defaults to 1.
    • --openai_base_url: Base URL for the OpenAI API, if not given, will use the environment variable OPENAI_API_BASE.
    • --openai_api_key: OpenAI API key, if not given, will use the environment variable OPENAI_API_KEY.
    • --json_output: Whether to use JSON format output, defaults to False.
  4. The generated data will be saved in the specified output file, with each line being a JSON object.

Example

python data_generator.py --file_path data.csv --prompt "Please generate a question-answer pair based on the given data:" --model gpt-3.5-turbo --batch_size 10 --generate_epoch 3

This will read the data.csv file, use the prompt "Please generate a question-answer pair based on the given data:", call the gpt-3.5-turbo model with a batch size of 10, and generate 3 epochs of data. The output will be saved in the data.csv.output.jsonl file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sft-data-generator-0.0.1.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sft_data_generator-0.0.1-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file sft-data-generator-0.0.1.tar.gz.

File metadata

  • Download URL: sft-data-generator-0.0.1.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.4

File hashes

Hashes for sft-data-generator-0.0.1.tar.gz
Algorithm Hash digest
SHA256 fedea112ae1d93ba98fd037f44e109634a03d27347df12b8554bb1e393d071b2
MD5 589a2987d385593f3b3db6cbc2f112d4
BLAKE2b-256 c9435cd090dd623e524275bfc4e203191f5b1869eb7c01ee10c6340470763caf

See more details on using hashes here.

File details

Details for the file sft_data_generator-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sft_data_generator-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b0a8795b6146ace90191a2118b3dbd3a19fa4439322dd44c72a340d47ed330c8
MD5 56a51327a82613243cca44f09600b3d5
BLAKE2b-256 99f3766fba5b36e44575ac2abaadb47d63a570c514843775b0b180748e1ceea5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page