Groq QA automates question-answer generation from text with LLaMA 3 and Hugging Face to fine-tune large language models (LLMs).
Project description
๐ฑ Groq QA Generator
Q: โWould you tell me, please, which way I ought to go from here?โ
A: โThat depends a good deal on where you want to get to,โ said the Cat.
โ Alice's Adventures in Wonderland
Groq QA is a Python library that automates creating question-answer pairs from text to fine-tune large language models (LLMs). Powered by Groq and extended with Hugging Face, it uses models like LLaMA 3 (70B parameters, 128K tokens) to generate high-quality QA pairs. The tool streamlines dataset preparation and enables easy fine-tuning in research, education, and domain-specific applications. |
---|
Note: This project is not affiliated with or endorsed by Groq, Inc. |
โจ Features
โจ Feature | ๐ Description | |
---|---|---|
โ | CLI | Use the CLI tool with the groq-qa command. |
โ | Python Library | Import directly to your own Python project and extend your code. |
โ | Advanced Models | Supports large models like LLaMA 3.1 70B via the Groq API. |
โ | Automated QA Generation | Generate question-answer pairs from input text. |
โ | Hugging Face Datasets | Convert QA pairs to Hugging Face datasets, then export or upload datasets to Hugging Face. |
โ | Custom Split Ratios | Define custom train/test split ratios when generating QA pairs. |
โ | Prompt Templates | Flexible question generation through prompt templates. |
โ | Customizable Configurations | Configure via CLI, config.json , or directly in Python code. |
๐จโ๐ป About the Developer
Hey there! Iโm Jordan, a Canadian network engineer with over a decade of experience, especially from my time in the fast-paced world of California startups. โ๏ธ My focus has been on automating test systems aligned with company KPIs, making it easier for teams to make data-driven decisions.
Whether itโs tackling tough challenges, improving codebases, or working on innovative ideas, Iโm always up for the task. Letโs connect on LinkedIn and make things happen! ๐ค
๐ Table of Contents
- ๐ฑ Groq QA Generator
- โจ Features
- ๐จโ๐ป About the Developer
- ๐ Quick Start
- ๐ฆ Upgrading
- ๐ Documentation
- โ๏ธ Using groq-qa
- ๐ค Hugging Face Datasets
- ๐ Configuration
- ๐ง Customizing the Configuration
- ๐ Input Data
- ๐ค Models
- ๐ง Technical Overview
- ๐งช Testing Overview
- ๐ค How to Contribute
- โ FAQ
- โ๏ธ License
๐ Quick Start
-
Install the package via
pip
:pip install groq-qa-generator
-
Set up the API key:
export GROQ_API_KEY=your_api_key_here
-
Run the
groq-qa
command with default settings (~/.groq_qa/config.json
):groq-qa
-
View the results in
~/.groq_qa/qa_output.txt
.
๐ฆ Upgrading
To ensure that you have the latest features, bug fixes, and improvements, it is recommended to periodically upgrade the groq_qa_generator
package.
You can update groq_qa_generator
to the latest version by running:
pip install --upgrade groq-qa-generator
๐ Documentation
You can access the full HTML documentation here:
๐ Groq QA Generator Documentation ๐
Documentation Generation
The documentation is automatically generated using Sphinx, a documentation generation tool for Python projects. Every change made to the documentation directory (docs/
) triggers a GitHub Actions workflow that builds the HTML files and deploys them to GitHub Pages. This ensures that the documentation stays up-to-date with the latest project changes.
โ๏ธ Using groq-qa
Setup the API Keys
First, you need to acquire both a Groq API key and a Hugging Face token.
- To get the Groq API key, sign up at the Groq website and follow their instructions for generating a key.
- To obtain the Hugging Face token, create an account on Hugging Face and generate a token in your account settings.
Setting the Environment Variables
Option 1: Using a .env
File (Recommended)
-
Create a
.env
file in your home directory (~/
) with the following content:export GROQ_API_KEY=gsk_example_1234567890abcdef export HF_TOKEN=hf_example_1234567890abcdef
-
Source the
.env
file to load the environment variables:source ~/.env
Option 2: Exporting Directly in the Terminal
Alternatively, you can export the key and token directly in your terminal.
export GROQ_API_KEY=gsk_example_1234567890abcdef
export HF_TOKEN=hf_example_1234567890abcdef
Command-Line Interface (CLI)
Once installed, the command groq-qa
becomes available. By default, this command reads from the default configuration located at ~/.groq_qa/config.json
.
Here are a few examples of how to use the groq-qa
command:
# Run with default config.json:
groq-qa
# Output results in JSON format:
groq-qa --json
# Run with model and temperature overrides:
groq-qa --model llama3-70b-8192 --temperature 0.9
# Run with questions override:
groq-qa --questions 1
# Run with a custom train/test split of 70% training data and JSON output:
groq-qa --split 0.7 --json
# Upload generated QA pairs to Hugging Face:
groq-qa --upload example-username/example-dataset-name
CLI Options:
- ๐ง
--model
: The default model to be used for generating QA pairs is defined inconfig.json
. The default is set tollama3-70b-8192
. - ๐ฅ
--temperature
: Controls the randomness of the model's output. Lower values will result in more deterministic and focused outputs, while higher values will make the output more random. The default is set to0.1
. - โ
--questions
: Allows you to specify the exact number of question-answer pairs to generate per chunk of text. For example, using1
will force the system to generate 1 QA pair for each chunk, regardless of chunk size or token limits. - ๐๏ธ
--json
: If this flag is included, the output will be saved in a JSON format. By default, the output is stored as a plain text file. The default is set toFalse
. - ๐ค
--upload
: Hugging Face repository path for uploading the QA dataset. For example,example-username/example-dataset-name
. - โ๏ธ
--split
: Fraction of the dataset to be used for training. For example,0.8
will allocate 80% of the data for training and 20% for testing.
Note: You can print out the full list of available CLI options and arguments by using the --help
option:
groq-qa --help
๐ค Hugging Face Datasets
groq-qa-generator
allows you to export the generated question-answer pairs in JSON format and upload them to Hugging Face as a dataset. This functionality enables you to easily integrate the generated QA pairs into machine learning pipelines, making it ready for fine-tuning models.
Export and Upload:
-
Generate the QA pairs and export them in JSON format using the
--json
flag:groq-qa --json
-
Use the
--upload
flag to specify the Hugging Face repository where the dataset should be uploaded:groq-qa --json --upload your-huggingface-username/your-dataset-repo
-
Optionally, you can also specify a train/test split ratio using the
--split
argument. The default is to split the data 80% for training and 20% for testing:groq-qa --json --upload your-huggingface-username/your-dataset-repo --split 0.8
The dataset will be uploaded to Hugging Face in the DatasetDict
format, and you can view or further process the dataset in your Hugging Face account.
๐ Configuration
When you run the groq-qa
command for the first time, a user-specific configuration directory (~/.groq_qa/
) is automatically created. This directory contains all the necessary configuration files and templates for customizing input, prompts, and output.
Directory Structure
~/.groq_qa/
โโโ config.json
โโโ data
โ โโโ alices_adventures_in_wonderland.txt
โ โโโ sample_input_data.txt
โ โโโ prompts
โ โโโ sample_question.txt
โ โโโ system_prompt.txt
โโโ qa_output_training.json
โโโ qa_output_test.json
โโโ qa_output_training.txt
โโโ qa_output_test.txt
Default config.json
{
"system_prompt": "system_prompt.txt",
"sample_question": "sample_question.txt",
"input_data": "sample_input_data.txt",
"output_file": "qa_output.txt",
"split_ratio": 0.8,
"huggingface_repo": "username/dataset",
"model": "llama3-70b-8192",
"chunk_size": 512,
"tokens_per_question": 60,
"temperature": 0.1,
"max_tokens": 1024
}
๐ง Customizing the Configuration
The ~/.groq_qa
directory contains essential files that can be customized to suit your specific needs. This directory includes the following components:
- ๐ config.json: This is the main configuration file where you can set various parameters for the QA generation process. You can customize settings such as:
- ๐ system_prompt: Specify the path to your custom system prompt file that defines how the model should behave.
- โ sample_question: Provide the path to a custom sample question file that helps guide the generation of questions.
- ๐ input_data: Set the path to your own text file from which you want to generate question-answer pairs.
- ๐พ output_file: Define the path where the generated QA pairs will be saved.
- ๐ split_ratio: Specify the ratio of the dataset to be used for training. The default is
0.8
for 80% training and 20% testing. - ๐ค huggingface_repo: Set the Hugging Face repository repo where the dataset will be uploaded.
Other configurable options include:
- ๐ค model: Select the model to be used for generating QA pairs (e.g.,
llama3-70b-8192
). - ๐ chunk_size: Set the number of tokens for each text chunk (e.g.,
512
). - ๐ช tokens_per_question: Specify the number of tokens allocated for each question (e.g.,
60
). - ๐ฅ temperature: Control the randomness of the model's output (e.g.,
0.1
). - ๐ช max_tokens: Define the maximum number of tokens the model can generate in the response (e.g.,
1024
).
By adjusting these files and settings, you can create a personalized environment for generating question-answer pairs that align with your specific use case.
๐ Input Data
This project uses text data from Alice's Adventures in Wonderland by Lewis Carroll, sourced from Project Gutenberg. The full text is available in the included data/alices_adventures_in_wonderland.txt
file.
Sample Input Data
For demonstration purposes, a smaller sample of the full text is included in data/sample_input_data.txt
. This file contains a portion of the main text, used to quickly test and generate question-answer pairs without processing the entire book.
๐ค Models
The groq_qa_generator
currently supports the following models via the Groq API:
Model Name | Model ID | Developer | Context Window | Description | Documentation Link |
---|---|---|---|---|---|
LLaMA 70B | llama3-70b-8192 | Meta | 8,192 tokens | A large language model with 70 billion parameters, suitable for high-quality QA pair generation. | Model Card |
LLaMA 3.1 70B | llama-3.1-70b-versatile | Meta | 128k tokens | A versatile large language model suitable for diverse applications. | Model Card |
Note: For optimal QA pair generation, it is recommended to use a larger model such as 70B, as its capacity helps ensure higher quality output. See Groq's supported models documentation for all options.
๐ Python Library
In addition to CLI usage, the groq_qa_generator
can be used directly in your Python project. Below is an example of how to configure and execute the question-answer generation process using a custom configuration:
Example
# main.py
from groq_qa_generator import groq_qa
def main():
# Define a custom configuration
custom_config = {
"system_prompt": "custom_system_prompt.txt",
"sample_question": "custom_sample_question.txt",
"input_data": "custom_input_data.txt",
"output_file": "qa_output.txt",
"split_ratio": 0.8,
"huggingface_repo": "username/dataset",
"model": "llama3-70b-8192",
"chunk_size": 512,
"tokens_per_question": 60,
"temperature": 0.1,
"max_tokens": 1024
}
# Generate question-answer pairs
train_dataset, test_dataset = groq_qa.generate(custom_config)
# Print both train and test datasets
print(f"Train Dataset: {train_dataset}"
f"\n\nTest Dataset: {test_dataset}")
if __name__ == "__main__":
main()
This allows you to integrate the functionality within any Python application easily.
๐ง Technical Overview
-
๐ API Interaction:
- API Keys: API keys and security tokens are securely retrieved from environment variables to ensure safe access to the Groq and Hugging Face API.
- Groq API: A Groq client is initialized to enable communication, providing access to powerful models like LLaMA 70B.
- Hugging Face API: Hugging Face Hub is used for uploading the generated QA datasets directly to a repository. This allows for easy integration of datasets into the Hugging Face ecosystem, facilitating model fine-tuning and further dataset management.
- API Keys: API keys and security tokens are securely retrieved from environment variables to ensure safe access to the Groq and Hugging Face API.
-
๐ Text Processing:
- Loading Prompts and Questions: The library includes methods to load sample questions and system prompts from specified file paths. These prompts are essential for guiding LLaMA 70B's response.
- Generating Full Prompts: The system prompt and sample question are combined into a complete prompt for the Groq API.
-
๐ค QA Pair Generation:
- The core process involves taking a list of text chunks and generating question-answer pairs using the Groq API to prompt LLaMA 70B.
- Loads the system prompt and sample question.
- Iterates through each text chunk, creating a full prompt for LLaMA 70B.
- Retrieves the completion from the Groq API, and in turn the model.
- Streams the completion response and converts it into question-answer pairs.
- Splits the QA pairs into separate train and test files based on the split ratio (default is 80% train, 20% test).
- Converts the generated QA pairs into a Hugging Face dataset and optionally uploads to a repository.
- The core process involves taking a list of text chunks and generating question-answer pairs using the Groq API to prompt LLaMA 70B.
๐งช Testing Overview
groq-qa-generator includes comprehensive tests that ensure its core functionalities are reliable and efficient:
-
API Interactions: Tests like
test_groq_api.py
mock API calls (e.g.,get_groq_client()
) to verify that external APIs such as Groq and Hugging Face function correctly. This ensures robust behavior even in different environments. -
Configuration Handling: Tests in
test_config.py
check proper loading of user-defined and default settings, ensuring flexibility and consistency across deployments. -
Tokenization & Data Processing:
test_tokenizer.py
andtest_dataset_processor.py
verify accurate text tokenization and the conversion of QA pairs into usable datasets. This is crucial for generating high-quality outputs. -
QA Generation: Core tests in
test_qa_generation.py
ensure reliable generation of question-answer pairs, which is the main functionality of the library. -
Logging:
test_logging_setup.py
ensures proper logging configuration, aiding in debugging and performance tracking.
๐ฅผ Running Tests
To run the project's tests, you can use Poetry and pytest
. Follow these steps:
- ๐ฆ Install Poetry: If you haven't already, install Poetry using pip.
pip install poetry
- ๐ง Install Dependencies: Navigate to the project directory and install the dependencies.
cd groq-qa-generator poetry install
- โ๏ธ Confirm the Environment: Verify that the virtual environment has been correctly set up and activated.
poetry shell
- ๐ Run Tests: Use pytest to run the tests.
poetry run pytest
๐ค How to Contribute
-
๐ด Fork the Repository: Click the "Fork" button at the top-right of the repository page to create your copy.
-
๐ฅ Clone Your Fork: Clone the forked repository to your local machine.
git clone https://github.com/your-username/groq-qa-generator.git
-
๐ฟ Create a Branch: Use a descriptive name for your branch to clearly indicate the purpose of your changes. This helps maintain organization and clarity in the project.
git checkout -b feature/your-feature-name
-
๐ง Set Up the Environment: Use Poetry to install the project dependencies.
cd groq-qa-generator poetry install
-
โ๏ธ Confirm the Environment: Verify that the virtual environment has been correctly set up and activated.
poetry shell
-
๐ฆ List Installed Packages: Ensure that the dependencies have been installed correctly by listing the installed packages.
poetry show
-
๐ Pick an Existing Issue or Suggest a New One:
- Check out the issues page to see if there's an open issue youโd like to work on. If you find one, just drop a comment to let everyone know you're taking it on.
- If you donโt see anything related to what you're working on, feel free to create a new issue to describe the bug, feature, or improvement you have in mind.
-
๐ Commit and Push: After making your changes, commit them with a clear message and push your branch to your forked repository.
git add . git commit -m "Add a concise description of your changes" git push origin feature/your-feature-name
โ FAQ
๐ Where can I find the generated QA pairs?
The generated QA pairs are saved to the output_file
defined in your config.json
file. By default, it saves the output in qa_output.txt
, located in your home directoryโs .groq_qa
folder (~/.groq_qa/qa_output.txt
).
To change the output file name, edit the output_file field in your config.json
file:
{
"output_file": "qa_custom_output.txt"
}
๐ Can I modify the sample question or system prompt templates?
Yes, both the system prompt and the sample question can be modified. These templates are located in the prompts directory inside the ~/.groq_qa/
folder:
system_prompt.txt
: Defines how the model should behave and guide the generation process.sample_question.txt
: Defines how sample questions should be structured. Feel free to edit these templates to suit your needs.
๐ How do I override default configuration settings?
You can override the default configuration settings in two ways:
- Edit the
config.json
file located in the~/.groq_qa/
directory. - Pass command-line arguments to override specific settings, for example:
groq-qa --model llama3-70b-8192 --temperature 0.9 --json
๐ How can I increase the randomness of the output?
Increase the temperature
value in the configuration or pass it as a command-line argument (e.g., --temperature 0.9
).
๐ How do I upload my QA pairs to Hugging Face?
The tool allows you to upload QA pairs directly to Hugging Face. You can configure the Hugging Face repository in the config.json
file under the huggingface_repo
field or pass it as a command-line argument using the --upload
option.
๐ข How can I split the dataset into training and test sets?
You can split your QA pairs dataset using the Hugging Face integration by specifying a split ratio in the configuration file or passing it via the command line with the --split
argument. For example, to split 80% of the data for training and 20% for testing, use --split 0.8
.
๐ Can I use this tool in a larger Python project?
Yes, groq_qa_generator
can be used as a Python library within your project. Simply import the module and configure it to generate QA pairs programmatically.
๐ฑ How can I contribute to the project?
If you'd like to contribute, feel free to browse the issues page to find something to work on or propose a new issue. Fork the repository, create a new branch, and submit a pull request once your changes are ready!
โ๏ธ License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file groq_qa_generator-1.2.1.tar.gz
.
File metadata
- Download URL: groq_qa_generator-1.2.1.tar.gz
- Upload date:
- Size: 85.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9913de065d985ce7a75e0d9ad94caa03ed326b1cd28f75a0cc6a2f20611a2d7 |
|
MD5 | 742c3216e3327c058936400b733d42f9 |
|
BLAKE2b-256 | 7764d503b9ff6342643843dc36c9c70b313aff595a80efdc7396e73b32257a31 |
File details
Details for the file groq_qa_generator-1.2.1-py3-none-any.whl
.
File metadata
- Download URL: groq_qa_generator-1.2.1-py3-none-any.whl
- Upload date:
- Size: 83.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef3359f9e8cbb23228aebfee0b1d861da166111930c1e9ab1da37c3aa5ffdb3f |
|
MD5 | 01422d6f4a190df3e744eb87cac326b9 |
|
BLAKE2b-256 | 8b441caf584dd998c1e6e63a2e6c6064f88fd38a381eb9bd611ddb280c148030 |