Fine-Tune GPT-3 Data Generator is a python package that generate fake datasets.
Project description
GPT3DataGen
GPT3DataGen is a python package that generates fake data for fine-tuning your openai
models.
_ ___ _ _
( )_ /'_ ) ( ) ( )_
__ _ _ | ,_)(_)_) | _| | _ _ | ,_) _ _ __ __ ___
/'_ `\( '_`\ | | _(_ < /'_` | /'_` )| | /'_` ) /'_ `\ /'__`\/' _ `\
( (_) || (_) )| |_ ( )_) |( (_| |( (_| || |_ ( (_| |( (_) |( ___/| ( ) |
`\__ || ,__/'`\__)`\____)`\__,_)`\__,_)`\__)`\__,_)`\__ |`\____)(_) (_)v1.0.3
( )_) || | ( )_) |
\___/'(_) \___/'
Install with pip. See Install & Usage Guide
pip install -U gpt3datagen
Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies:
pip install git+https://github.com/donwany/gpt3datagen.git --use-pep517
Or git clone repository:
git clone https://github.com/donwany/gpt3datagen.git
cd gpt3datagen
make install && pip install -e .
To update the package to the latest version of this repository, please run:
pip install --upgrade --no-deps --force-reinstall git+https://github.com/donwany/gpt3datagen.git
Command-Line Usage
Run the following to view all available options:
gpt3datagen --help
gpt3datagen --version
Output formats: jsonl
, json
, csv
, tsv
, xlsx
gpt3datagen \
--num_samples 500 \
--max_length 2048 \
--sample_type "classification" \
--output_format "jsonl" \
--output_dir .
gpt3datagen \
--num_samples 500 \
--max_length 2048 \
--sample_type completion \
--output_format csv \
--output_dir .
gpt3datagen \
--sample_type completion \
--output_format jsonl \
--output_dir .
gpt3datagen --sample_type completion -o . -f jsonl
gpt3datagen --sample_type news -o . -f jsonl
Data Format
{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
...
Basic Usage
Only useful if you clone the repository
python prepare.py \
--num_samples 500 \
--max_length 2048 \
--sample_type "classification" \
--output_format "jsonl" \
--output_dir .
python prepare.py \
--num_samples 500 \
--max_length 2048 \
--sample_type "completion" \
--output_format "csv" \
--output_dir .
python prepare.py \
--num_samples 500 \
--max_length 2048 \
--sample_type "completion" \
--output_format "json" \
--output_dir /Users/<tsiameh>/Desktop
Validate Sample Data
pip install --upgrade openai
export OPENAI_API_KEY="<OPENAI_API_KEY>"
# validate sample datasets generated
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.jsonl
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.csv
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.tsv
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.json
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.xlsx
openai tools fine_tunes.prepare_data -f /Users/<tsiameh>/Desktop/data_prepared.jsonl
# fine-tune
openai api fine_tunes.create \
-t <DATA_PREPARED>.jsonl \
-m <BASE_MODEL: davinci, curie, ada, babbage>
# List all created fine-tunes
openai api fine_tunes.list
Test Runs
# For multiclass classification
openai api fine_tunes.create \
-t <TRAIN_FILE_ID_OR_PATH> \
-v <VALIDATION_FILE_OR_PATH> \
-m <MODEL> \
--compute_classification_metrics \
--classification_n_classes <N_CLASSES>
# For binary classification
openai api fine_tunes.create \
-t <TRAIN_FILE_ID_OR_PATH> \
-v <VALIDATION_FILE_OR_PATH> \
-m <MODEL> \
--compute_classification_metrics \
--classification_n_classes 2 \
--classification_positive_class <POSITIVE_CLASS_FROM_DATASET>
Contribute
Please see CONTRIBUTING.
License
GPT3DataGen is released under the MIT License. See the bundled LICENSE file for details.
Credits
Theophilus Siameh
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gpt3datagen-0.1.0.tar.gz
.
File metadata
- Download URL: gpt3datagen-0.1.0.tar.gz
- Upload date:
- Size: 469.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 662e0ef4e84bb360b1a47942897a90986294820cad27f70b835cd485ef1aacbd |
|
MD5 | e6aa95b8bbc1c77c5f21d948b4450ee9 |
|
BLAKE2b-256 | 0ee5781b09c3e1d85953defb0953301e925eb97b871dd6fe5b8703f16fa5a800 |
File details
Details for the file gpt3datagen-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: gpt3datagen-0.1.0-py3-none-any.whl
- Upload date:
- Size: 471.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d2ae55b75b65f37fa7b16d80c3c7a62e131cfa59ae537289102069803e2b9e1 |
|
MD5 | 516b01178050e543593c3b0a0e305c8e |
|
BLAKE2b-256 | 1917e5110f73b80d45c3dd50a02e7a62fb19b247f5370729680aa09b3e7ce721 |