A Python Library for Automated Generator Prompting and Dataset Generation
Project description
GeneratorPromptKit
A Python Library and Framework for Automated Generator Prompting and Dataset Generation
This is a Python library and framework for automated generator prompting and dataset generation using large language models (LLMs). Inspired by the work of Chen et al. in their paper "GenQA: Generating Millions of Instructions from a Handful of Prompts", this library provides a structured approach to creating diverse and high-quality datasets using a combination of generator prompts, topic extraction, and subtopic exploration.
Overview
The key idea behind GeneratorPromptKit is to leverage the power of LLMs to generate diverse and relevant questions and answers based on a given input domain. By iteratively extracting topics and subtopics from the domain and using carefully crafted generator prompts, the library enables the creation of large-scale datasets with minimal human intervention.
graph TD
A[Input Domain] --> B[Extract Topics]
B --> C[Iterate over Topics]
C --> D[Extract Subtopics]
D --> E[Generate Questions and Answers]
E --> F[Store Generated Dataset]
F --> G[Output Dataset]
subgraph Topic Extraction
B
end
subgraph Subtopic Extraction
D
end
subgraph Question and Answer Generation
E
end
style Topic Extraction fill:#f9d,stroke:#333,stroke-width:2px
style Subtopic Extraction fill:#dbf,stroke:#333,stroke-width:2px
style Question and Answer Generation fill:#ffd,stroke:#333,stroke-width:2px
Features
- Automated Topic and Subtopic Extraction: GeneratorPromptKit automatically extracts topics and subtopics from the input domain using LLMs, enabling efficient exploration of the domain space.
- Generator Prompts: The library provides a set of carefully designed generator prompts that encourage diversity, creativity, and relevance in the generated questions and answers.
- Customizable Prompts: Users can easily modify and extend the existing prompts or add their own prompts to suit their specific requirements.
- Randomness and Diversity: GeneratorPromptKit incorporates randomness boosters and indirect question generators to enhance the diversity of the generated dataset.
- Integration with OpenAI API: The library seamlessly integrates with the OpenAI API, allowing users to leverage their language models for dataset generation.
Installation
To install GeneratorPromptKit, simply use pip:
pip install generator-prompt-kit
Usage
Here's a basic example of how to use GeneratorPromptKit to generate a dataset:
from generator_prompt_kit import GeneratorPromptKit
# Initialize the GeneratorPromptKit
gpk = GeneratorPromptKit(api_key='your_api_key')
# Set the input domain
input_domain = "Computer Science"
# Generate the dataset
dataset = gpk.generate_dataset(input_domain, num_topics=10, num_subtopics=5, num_questions=100)
# Save the dataset to a file
dataset.save('computer_science_dataset.json')
Performance
For detailed benchmarks and experimental results, please refer to the original paper "GenQA: Generating Millions of Instructions from a Handful of Prompts" by Chen et al. GeneratorPromptKit was created as an inspiration from their work and aims to provide a practical implementation of the concepts and techniques discussed in the paper.
Cite our Work
@inproceedings{generator-prompt-kit,
title = {GeneratorPromptKit: A Python Library for Automated Generator Prompting and Dataset Generation},
author = {Aman Priyanshu, Supriti Vijay},
year = {2024},
publisher = {{GitHub}},
url = {https://github.com/AmanPriyanshu/GeneratorPromptKit}
}
References
- Chen et al. "GenQA: Generating Millions of Instructions from a Handful of Prompts". arXiv preprint arXiv:2406.10323, 2024.
Contributing
Contributions to GeneratorPromptKit are welcome! If you encounter any issues, have suggestions for improvements, or want to add new features, please open an issue or submit a pull request on the GitHub repository
License
GeneratorPromptKit is released under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for generator_prompt_kit-1.0.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e0b2b6c514f836f6fdf69e6fcd4a31892fd834dd0db38ea502fc608f7df7fd2 |
|
MD5 | 937b5ce1223f1df8fb52a8ca80e78809 |
|
BLAKE2b-256 | 15f3200b81f0f08f18af212c510e792461b10491af6ddd927bf27db08842c76d |
Hashes for generator_prompt_kit-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53a1208221ef770a898a92ac2914037d6dce3c335c297b2567fc2f455d945a71 |
|
MD5 | 3714965c2552e40a21371c46e430eddc |
|
BLAKE2b-256 | 571b3cb6c333c878641a478735efca8ce2b9697143160e471d7737a96d7af3ea |