Skip to main content

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Project description

stars forks open issues issue resolution

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

๐Ÿ“š Table of Contents

๐Ÿ“ What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our paper.

It begins by constructing a fine-grained knowledge graph from the source text๏ผŒthen identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

๐Ÿš€ Quick Start

Experience it on the OpenXLab Application Center

Gradio Demo

python webui/app.py

ui

Run from PyPI

  1. Install GraphGen

    pip install graphg
    
  2. Run in CLI

    SYNTHESIZER_MODEL=your_synthesizer_model_name \
    SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
    SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
    TRAINEE_MODEL=your_trainee_model_name \
    TRAINEE_BASE_URL=your_base_url_for_trainee_model \
    TRAINEE_API_KEY=your_api_key_for_trainee_model \
    graphg --output_dir cache
    

Run from Source

  1. Install dependencies
    pip install -r requirements.txt
    
  2. Configure the environment
    • Create an .env file in the root directory
      cp .env.example .env
      
    • Set the following environment variables:
      # Synthesizer is the model used to construct KG and generate data
      SYNTHESIZER_MODEL=your_synthesizer_model_name
      SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
      SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
      # Trainee is the model used to train with the generated data
      TRAINEE_MODEL=your_trainee_model_name
      TRAINEE_BASE_URL=your_base_url_for_trainee_model
      TRAINEE_API_KEY=your_api_key_for_trainee_model
      
  3. (Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file.
    # configs/graphgen_config.yaml
    # Example configuration
    data_type: "raw"
    input_file: "resources/examples/raw_demo.jsonl"
    # more configurations...
    
  4. Run the generation script
    bash scripts/generate.sh
    
  5. Get the generated data
    ls cache/data/graphgen
    

๐Ÿ—๏ธ System Architecture

Directory Structure

โ”œโ”€โ”€ baselines/           # baseline methods
โ”œโ”€โ”€ cache/               # cache files
โ”‚   โ”œโ”€โ”€ data/            # generated data
โ”‚   โ”œโ”€โ”€ logs/            # log files
โ”œโ”€โ”€ configs/             # configuration files
โ”œโ”€โ”€ graphgen/            # GraphGen implementation
โ”‚   โ”œโ”€โ”€ operators/       # operators
โ”‚   โ”œโ”€โ”€ graphgen.py      # main file
โ”œโ”€โ”€ models/              # base classes
โ”œโ”€โ”€ resources/           # static files and examples
โ”œโ”€โ”€ scripts/             # scripts for running experiments
โ”œโ”€โ”€ templates/           # prompt templates
โ”œโ”€โ”€ utils/               # utility functions
โ”œโ”€โ”€ webui/               # web interface
โ””โ”€โ”€ README.md

Workflow

workflow

๐Ÿ€ Acknowledgements

  • SiliconCloud Abundant LLM API, some models are free
  • LightRAG Simple and efficient graph retrieval solution
  • ROGRAG ROGRAG: A Robustly Optimized GraphRAG Framework

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graphg-20250416.tar.gz (64.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

graphg-20250416-py3-none-any.whl (87.3 kB view details)

Uploaded Python 3

File details

Details for the file graphg-20250416.tar.gz.

File metadata

  • Download URL: graphg-20250416.tar.gz
  • Upload date:
  • Size: 64.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for graphg-20250416.tar.gz
Algorithm Hash digest
SHA256 db041c714db1f7e2858d169ed3bab64ac26c163257449597471f6e3a7507aa7b
MD5 73eefa4ff8b827bd544beaede6820198
BLAKE2b-256 79b1deb221c7a13de343ce605b48290cbf3fe85d7cf45c5c0f9af7190b5f0d3d

See more details on using hashes here.

File details

Details for the file graphg-20250416-py3-none-any.whl.

File metadata

  • Download URL: graphg-20250416-py3-none-any.whl
  • Upload date:
  • Size: 87.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for graphg-20250416-py3-none-any.whl
Algorithm Hash digest
SHA256 7fff58b5888da10e01d4dbfa2b8e6ba0ea8e0d0f75c73fd264e4462b52a4374b
MD5 8941622520b356937e2a4b2e864ebb8f
BLAKE2b-256 3c862e6207be3334d2ef068205fb69ee97faf70e1a33b65c26e17b052a3ef729

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page