GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
Project description
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
๐ Table of Contents
- ๐ What is GraphGen?
- ๐ Quick Start
- ๐ Latest Updates
- ๐ Key Features
- ๐๏ธ System Architecture
- โ๏ธ Configurations
- ๐ Roadmap
- ๐ฐ Cost Analysis
๐ What is GraphGen?
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our paper.
It begins by constructing a fine-grained knowledge graph from the source text๏ผthen identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
๐ Quick Start
Experience it on the OpenXLab Application Center
Gradio Demo
python webui/app.py
Run from PyPI
-
Install GraphGen
pip install graphg
-
Run in CLI
SYNTHESIZER_MODEL=your_synthesizer_model_name \ SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \ SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \ TRAINEE_MODEL=your_trainee_model_name \ TRAINEE_BASE_URL=your_base_url_for_trainee_model \ TRAINEE_API_KEY=your_api_key_for_trainee_model \ graphg --output_dir cache
Run from Source
- Install dependencies
pip install -r requirements.txt
- Configure the environment
- Create an
.envfile in the root directorycp .env.example .env
- Set the following environment variables:
# Synthesizer is the model used to construct KG and generate data SYNTHESIZER_MODEL=your_synthesizer_model_name SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model # Trainee is the model used to train with the generated data TRAINEE_MODEL=your_trainee_model_name TRAINEE_BASE_URL=your_base_url_for_trainee_model TRAINEE_API_KEY=your_api_key_for_trainee_model
- Create an
- (Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file.
# configs/graphgen_config.yaml # Example configuration data_type: "raw" input_file: "resources/examples/raw_demo.jsonl" # more configurations...
- Run the generation script
bash scripts/generate.sh - Get the generated data
ls cache/data/graphgen
๐๏ธ System Architecture
Directory Structure
โโโ baselines/ # baseline methods
โโโ cache/ # cache files
โ โโโ data/ # generated data
โ โโโ logs/ # log files
โโโ configs/ # configuration files
โโโ graphgen/ # GraphGen implementation
โ โโโ operators/ # operators
โ โโโ graphgen.py # main file
โโโ models/ # base classes
โโโ resources/ # static files and examples
โโโ scripts/ # scripts for running experiments
โโโ templates/ # prompt templates
โโโ utils/ # utility functions
โโโ webui/ # web interface
โโโ README.md
Workflow
๐ Acknowledgements
- SiliconCloud Abundant LLM API, some models are free
- LightRAG Simple and efficient graph retrieval solution
- ROGRAG ROGRAG: A Robustly Optimized GraphRAG Framework
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graphg-20250416.tar.gz.
File metadata
- Download URL: graphg-20250416.tar.gz
- Upload date:
- Size: 64.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db041c714db1f7e2858d169ed3bab64ac26c163257449597471f6e3a7507aa7b
|
|
| MD5 |
73eefa4ff8b827bd544beaede6820198
|
|
| BLAKE2b-256 |
79b1deb221c7a13de343ce605b48290cbf3fe85d7cf45c5c0f9af7190b5f0d3d
|
File details
Details for the file graphg-20250416-py3-none-any.whl.
File metadata
- Download URL: graphg-20250416-py3-none-any.whl
- Upload date:
- Size: 87.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fff58b5888da10e01d4dbfa2b8e6ba0ea8e0d0f75c73fd264e4462b52a4374b
|
|
| MD5 |
8941622520b356937e2a4b2e864ebb8f
|
|
| BLAKE2b-256 |
3c862e6207be3334d2ef068205fb69ee97faf70e1a33b65c26e17b052a3ef729
|