A tool for generating data for SFT using LLM
Project description
SFT Data Generator
This is a tool for generating data for Supervised Fine-Tuning (SFT) using large language models like GPT-3.5. It can read data from CSV or Excel files and generate corresponding output data using OpenAI's API based on a given prompt.
Features
- Supports reading CSV and Excel files
- Uses OpenAI's Chat Completion API to generate data
- Supports customizable prompts and models
- Supports batch generation with customizable batch size
- Supports setting the number of generation epochs, i.e., generating multiple rounds for the entire dataset
- Supports JSON format output
- Can set OpenAI API URL and key via command line arguments or environment variables
Usage
-
Install dependencies:
pip install -r requirements.txt -
Prepare the input data file (CSV or Excel) and prompt template.
-
Run the command:
python data_generator.py --file_path <input_file_path> --prompt <prompt_template> --model <model_name> [other optional arguments]Required arguments:
--file_path: Path to the input data file, must be a CSV or Excel file.--prompt: Prompt template used for generating data.--model: Name of the OpenAI model used for generating data, e.g.,gpt-3.5-turbo.
Optional arguments:
--output_file: Output file path, defaults to input file path +.output.jsonl.--batch_size: Batch size, i.e., the number of samples per concurrent request, defaults to 1.--generate_epoch: Number of generation epochs, i.e., the number of rounds to generate for the entire dataset, defaults to 1.--openai_base_url: Base URL for the OpenAI API, if not given, will use the environment variableOPENAI_API_BASE.--openai_api_key: OpenAI API key, if not given, will use the environment variableOPENAI_API_KEY.--json_output: Whether to use JSON format output, defaults to False.
-
The generated data will be saved in the specified output file, with each line being a JSON object.
Example
python data_generator.py --file_path data.csv --prompt "Please generate a question-answer pair based on the given data:" --model gpt-3.5-turbo --batch_size 10 --generate_epoch 3
This will read the data.csv file, use the prompt "Please generate a question-answer pair based on the given data:", call the gpt-3.5-turbo model with a batch size of 10, and generate 3 epochs of data. The output will be saved in the data.csv.output.jsonl file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sft-data-generator-0.0.2.tar.gz.
File metadata
- Download URL: sft-data-generator-0.0.2.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dab2b0b3922ea58edbcfd317037135a4fdbd4016a788a1b3fbc71c5a79c5e2c9
|
|
| MD5 |
14e59887c44955e3687846991a112e0a
|
|
| BLAKE2b-256 |
c2a883be7a9cf2dac7531b9bacb8b9368fd5c53eb2ae884848244255085e821b
|
File details
Details for the file sft_data_generator-0.0.2-py3-none-any.whl.
File metadata
- Download URL: sft_data_generator-0.0.2-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4453d3efde8d52169f3c21812930ab242f600fd45f64798458d0851c02cb3bfc
|
|
| MD5 |
c37e1434f4afc158e43722db0cf85f4f
|
|
| BLAKE2b-256 |
8d4338ae1ea1c84da3f327527e5a17133a56770c6cc51fd8199da691e4dfa15d
|