A tool for generating data for SFT using LLM

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

SFT Data Generator

This is a tool for generating data for Supervised Fine-Tuning (SFT) using large language models like GPT-3.5. It can read data from CSV or Excel files and generate corresponding output data using OpenAI's API based on a given prompt.

Features

Supports reading CSV and Excel files
Uses OpenAI's Chat Completion API to generate data
Supports customizable prompts and models
Supports batch generation with customizable batch size
Supports setting the number of generation epochs, i.e., generating multiple rounds for the entire dataset
Supports JSON format output
Can set OpenAI API URL and key via command line arguments or environment variables

Usage

Install dependencies:
```
pip install -r requirements.txt
```
Prepare the input data file (CSV or Excel) and prompt template.
Run the command:
```
sft-data-generator --file_path <input_file_path> --prompt <prompt_template> --model <model_name> [other optional arguments]
```
or
```
python data_generator.py --file_path <input_file_path> --prompt <prompt_template> --model <model_name> [other optional arguments]
```
Required arguments:
- --file_path: Path to the input data file, must be a CSV or Excel file.
- --prompt: Prompt template used for generating data.
- --model: Name of the OpenAI model used for generating data, e.g., gpt-3.5-turbo.
Optional arguments:
- --output_file: Output file path, defaults to input file path + .output.jsonl.
- --batch_size: Batch size, i.e., the number of samples per concurrent request, defaults to 1.
- --generate_epoch: Number of generation epochs, i.e., the number of rounds to generate for the entire dataset, defaults to 1.
- --openai_base_url: Base URL for the OpenAI API, if not given, will use the environment variable OPENAI_API_BASE.
- --openai_api_key: OpenAI API key, if not given, will use the environment variable OPENAI_API_KEY.
- --json_output: Whether to use JSON format output, defaults to False.
The generated data will be saved in the specified output file, with each line being a JSON object.

Example

sft-data-generator --file_path data.csv --prompt "Please generate a question-answer pair based on the given data:" --model gpt-3.5-turbo --batch_size 10 --generate_epoch 3

python data_generator.py --file_path data.csv --prompt "Please generate a question-answer pair based on the given data:" --model gpt-3.5-turbo --batch_size 10 --generate_epoch 3

This will read the data.csv file, use the prompt "Please generate a question-answer pair based on the given data:", call the gpt-3.5-turbo model with a batch size of 10, and generate 3 epochs of data. The output will be saved in the data.csv.output.jsonl file.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.3

May 17, 2024

0.0.2

May 15, 2024

0.0.1

May 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sft-data-generator-0.0.3.tar.gz (6.2 kB view details)

Uploaded May 17, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sft_data_generator-0.0.3-py3-none-any.whl (6.8 kB view details)

Uploaded May 17, 2024 Python 3

File details

Details for the file sft-data-generator-0.0.3.tar.gz.

File metadata

Download URL: sft-data-generator-0.0.3.tar.gz
Upload date: May 17, 2024
Size: 6.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for sft-data-generator-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`0067840f1f69b02d616146216c32fe807aa797b3828d44d9c01dc0ca9979bde9`
MD5	`42cc6fa1b37a4712662ff9ec8dae40d3`
BLAKE2b-256	`e133253dda90dae6d2867392b26857129a610260e1a134d82fe43635d9240599`

See more details on using hashes here.

File details

Details for the file sft_data_generator-0.0.3-py3-none-any.whl.

File metadata

Download URL: sft_data_generator-0.0.3-py3-none-any.whl
Upload date: May 17, 2024
Size: 6.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for sft_data_generator-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b9c961ed392a199e7c15ca8ee67c062a218e724934117f70f2028ef6c06cd7f6`
MD5	`a2a97cc6a654f48778323c45dc88d490`
BLAKE2b-256	`4fcadc182544055f0110bb628e26280143e2c2069bc3f341dca18ee94c1410a3`

See more details on using hashes here.

sft-data-generator 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SFT Data Generator

Features

Usage

Example

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes