Skip to main content

A package for generating synthetic data for LLMs

Project description

Phinity

Phinity is a synthetic data generation SDK designed to create high-quality, verifiable datasets for LLM development and evaluation.

One of the most difficult aspects of synthetic data generation at scale is diversity. Instruction generation methods like WizardLM Evol-Instruct have been developed to enable diverse instruction generation at scale - to do this, they continuously create new prompts from a seed set of prompts that the user provides by "evolving" them in the domain. Think of a never-ending family tree: prompts give birth to new prompts with various added mutations through generations. Now you have 1000000s of new family members from a starting set of a couple of seed family members. Evol-Instruct is used by frontier AI labs to generate code SFT (supervised fine-tuning) datasets for LLMs.

We extend this approach to support custom domain-specific dataset generation, ensuring high-quality data that aligns with your rules and context.

🎯 Key Features

Instruction Evolution Framework

Phinity enables structured prompt evolution with multiple built-in strategies:

  • Deepening – Makes instructions more detailed and specific.
  • Concretizing – Adds concrete examples or scenarios.
  • Reasoning – Enhances reasoning or step-by-step explanations.
  • Comparative – Transforms prompts to include comparative elements.

Users can add custom evolution strategies, define domain-specific constraints, and integrate supporting documents for more controlled instruction generation.

Document/Knowledge Base Support

  • Instruction Verification and Repair: Phinity includes robust document verification to ensure evolved instructions remain answerable and relevant to provided sources. There is an instruction repair pipeline that detects and corrects instructions that drift from document context (_repair_instruction and _simplify_instruction) which supports both strict and partial answerability checks and integration with vector databases such as ChromaDB.

RAG Benchmark Generation

Phinity also supports creating multi-hop RAG benchmarks by constructing knowledge graphs from documents and generating synthetic QA pairs.

Quick Start

Installation

pip install phinitydata

Basic Usage (Step-by-Step)

Follow these steps to generate evolved instructions with Phinity:

1. Import the necessary modules

from phinitydata.testset.sft_generator import SFTGenerator
import os

2. Set up your OpenAI API key

# Option 1: Set environment variable
# export OPENAI_API_KEY='your-api-key-here'
    
# Option 2: Pass directly to generator
api_key = os.getenv("OPENAI_API_KEY") or "your-api-key-here"

3. Initialize the generator

generator = SFTGenerator(
    api_key=api_key,
    llm_model="gpt-4o-mini",  # or your preferred model
    temperature=0.7
)

4. Define your seed instructions

seed_instructions = [
    "What is machine learning?",
    "Explain how neural networks work"
]

5. Generate instruction-response pairs

results = generator.generate_evolved_instructions(
    seed_instructions=seed_instructions,
    target_samples=5,  # Number of samples to generate
    max_generations=10,  # Maximum evolution generations
    domain_context="artificial intelligence and machine learning",
    generate_responses=True,  # Set False to generate instructions only
    export_format="jsonl",
    export_path="evolved_instructions.jsonl"
)

6. Access the generated data

print("\n=== Generated Samples ===")
for i, sample in enumerate(results['samples'], 1):
    print(f"\nSample {i}:")
    print(f"Instruction: {sample['instruction']}")
    if 'response' in sample:
        print(f"Response: {sample['response'][:100]}...")  # Show first 100 chars
    print(f"Generation: {sample['metadata']['generation']}")
    print(f"Strategy: {sample['metadata']['strategy']}")

7. Print metrics

print("\n=== Generation Metrics ===")
print(f"Total generations: {results['metrics']['generations']}")
print(f"Time taken: {results['metrics']['total_time']:.2f} seconds")
print(f"Samples generated: {results['metrics']['samples_generated']}")
print(f"Samples verified: {results['metrics']['samples_verified']}")

8. Document-grounded generation (optional)

documents = [
    "Machine learning is a subset of artificial intelligence...",
    "Neural networks are composed of layers of interconnected nodes..."
]

grounded_results = generator.generate_evolved_instructions(
    seed_instructions=seed_instructions,
    target_samples=5,
    max_generations=10,
    docs=documents,  # Provide documents for grounding
    domain_context="artificial intelligence",
    export_path="grounded_instructions.jsonl"
)

Documentation and Roadmap

For comprehensive documentation, visit: https://phinity.gitbook.io/phinity/use-cases/in-domain-sft

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phinitydata-0.1.0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phinitydata-0.1.0-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file phinitydata-0.1.0.tar.gz.

File metadata

  • Download URL: phinitydata-0.1.0.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for phinitydata-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9dbefdf362a1ebe13b76cafeb0a4355975051b30595380c0f69e8e7b572f716c
MD5 07e0dcf331332b2b6dd347c55d0f0664
BLAKE2b-256 7ac6ba73de26b51b8263da86f1ad6904c7a5ee66176c5b8b9954995b2384528c

See more details on using hashes here.

File details

Details for the file phinitydata-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: phinitydata-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for phinitydata-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea998e4247de6322d681f2903cc1d04ddc6b54905d5a851a95098e509a4442b8
MD5 69464ade4a34eb90a927d8fbcf2fe4e1
BLAKE2b-256 323e153053f85476423c3c588a1819f9ef96c0bf8d989222c5d84b16b04ec085

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page