Fine-Tune GPT-3 Data Generator is a python package that generate fake datasets.
Project description
GPT3DataGen
GPT3DataGen is a python package that generates fake data for fine-tuning your openai
models.
_ ___ _ _
( )_ /'_ ) ( ) ( )_
__ _ _ | ,_)(_)_) | _| | _ _ | ,_) _ _ __ __ ___
/'_ `\( '_`\ | | _(_ < /'_` | /'_` )| | /'_` ) /'_ `\ /'__`\/' _ `\
( (_) || (_) )| |_ ( )_) |( (_| |( (_| || |_ ( (_| |( (_) |( ___/| ( ) |
`\__ || ,__/'`\__)`\____)`\__,_)`\__,_)`\__)`\__,_)`\__ |`\____)(_) (_)v1.0.3
( )_) || | ( )_) |
\___/'(_) \___/'
Install with pip. See Install & Usage Guide
pip install -U gpt3datagen
Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies:
pip install git+https://github.com/donwany/gpt3datagen.git --use-pep517
Or git clone repository:
git clone https://github.com/donwany/gpt3datagen.git
cd gpt3datagen
make install && pip install -e .
To update the package to the latest version of this repository, please run:
pip install --upgrade --no-deps --force-reinstall git+https://github.com/donwany/gpt3datagen.git
Command-Line Usage
Run the following to view all available options:
gpt3datagen --help
gpt3datagen --version
Output formats: jsonl
, json
, csv
, tsv
, xlsx
gpt3datagen \
--num_samples 500 \
--max_length 2048 \
--sample_type "classification" \
--output_format "jsonl" \
--output_dir .
gpt3datagen \
--num_samples 500 \
--max_length 2048 \
--sample_type completion \
--output_format csv \
--output_dir .
gpt3datagen \
--sample_type completion \
--output_format jsonl \
--output_dir .
gpt3datagen --sample_type completion -o . -f jsonl
gpt3datagen --sample_type news -o . -f jsonl
Data Format
{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
...
Basic Usage
Only useful if you clone the repository
python prepare.py \
--num_samples 500 \
--max_length 2048 \
--sample_type "classification" \
--output_format "jsonl" \
--output_dir .
python prepare.py \
--num_samples 500 \
--max_length 2048 \
--sample_type "completion" \
--output_format "csv" \
--output_dir .
python prepare.py \
--num_samples 500 \
--max_length 2048 \
--sample_type "completion" \
--output_format "json" \
--output_dir /Users/<tsiameh>/Desktop
Validate Sample Data
pip install --upgrade openai
export OPENAI_API_KEY="<OPENAI_API_KEY>"
# validate sample datasets generated
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.jsonl
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.csv
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.tsv
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.json
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.xlsx
openai tools fine_tunes.prepare_data -f /Users/<tsiameh>/Desktop/data_prepared.jsonl
# fine-tune
openai api fine_tunes.create \
-t <DATA_PREPARED>.jsonl \
-m <BASE_MODEL: davinci, curie, ada, babbage>
# List all created fine-tunes
openai api fine_tunes.list
Test Runs
# For multiclass classification
openai api fine_tunes.create \
-t <TRAIN_FILE_ID_OR_PATH> \
-v <VALIDATION_FILE_OR_PATH> \
-m <MODEL> \
--compute_classification_metrics \
--classification_n_classes <N_CLASSES>
# For binary classification
openai api fine_tunes.create \
-t <TRAIN_FILE_ID_OR_PATH> \
-v <VALIDATION_FILE_OR_PATH> \
-m <MODEL> \
--compute_classification_metrics \
--classification_n_classes 2 \
--classification_positive_class <POSITIVE_CLASS_FROM_DATASET>
Contribute
Please see CONTRIBUTING.
License
GPT3DataGen is released under the MIT License. See the bundled LICENSE file for details.
Credits
Theophilus Siameh
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gpt3datagen-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d2ae55b75b65f37fa7b16d80c3c7a62e131cfa59ae537289102069803e2b9e1 |
|
MD5 | 516b01178050e543593c3b0a0e305c8e |
|
BLAKE2b-256 | 1917e5110f73b80d45c3dd50a02e7a62fb19b247f5370729680aa09b3e7ce721 |