A package for generating synthetic data for LLMs

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

Phinity

Phinity is a synthetic data generation SDK designed to create high-quality, verifiable datasets for LLM development and evaluation.

One of the most difficult aspects of synthetic data generation at scale is diversity. Instruction generation methods like WizardLM Evol-Instruct have been developed to enable diverse instruction generation at scale - to do this, they continuously create new prompts from a seed set of prompts that the user provides by "evolving" them in the domain. Think of a never-ending family tree: prompts give birth to new prompts with various added mutations through generations. Now you have 1000000s of new family members from a starting set of a couple of seed family members. Evol-Instruct is used by frontier AI labs to generate code SFT (supervised fine-tuning) datasets for LLMs.

We extend this approach to support custom domain-specific dataset generation, ensuring high-quality data that aligns with your rules and context.

🎯 Key Features

Instruction Evolution Framework

Phinity enables structured prompt evolution with multiple built-in strategies:

Deepening – Makes instructions more detailed and specific.
Concretizing – Adds concrete examples or scenarios.
Reasoning – Enhances reasoning or step-by-step explanations.
Comparative – Transforms prompts to include comparative elements.

Users can add custom evolution strategies, define domain-specific constraints, and integrate supporting documents for more controlled instruction generation.

Document/Knowledge Base Support

Instruction Verification and Repair: Phinity includes robust document verification to ensure evolved instructions remain answerable and relevant to provided sources. There is an instruction repair pipeline that detects and corrects instructions that drift from document context (_repair_instruction and _simplify_instruction) which supports both strict and partial answerability checks and integration with vector databases such as ChromaDB.

RAG Benchmark Generation

Phinity also supports creating multi-hop RAG benchmarks by constructing knowledge graphs from documents and generating synthetic QA pairs.

Quick Start

Installation

pip install phinitydata

Basic Usage (Step-by-Step)

Follow these steps to generate evolved instructions with Phinity:

1. Import the necessary modules

from phinitydata.testset.sft_generator import SFTGenerator
import os

2. Set up your OpenAI API key

# Option 1: Set environment variable
# export OPENAI_API_KEY='your-api-key-here'
    
# Option 2: Pass directly to generator
api_key = os.getenv("OPENAI_API_KEY") or "your-api-key-here"

3. Initialize the generator

generator = SFTGenerator(
    api_key=api_key,
    llm_model="gpt-4o-mini",  # or your preferred model
    temperature=0.7
)

4. Define your seed instructions

seed_instructions = [
    "What is machine learning?",
    "Explain how neural networks work"
]

5. Generate instruction-response pairs

results = generator.generate_evolved_instructions(
    seed_instructions=seed_instructions,
    target_samples=5,  # Number of samples to generate
    max_generations=10,  # Maximum evolution generations
    domain_context="artificial intelligence and machine learning",
    generate_responses=True,  # Set False to generate instructions only
    export_format="jsonl",
    export_path="evolved_instructions.jsonl"
)

6. Access the generated data

print("\n=== Generated Samples ===")
for i, sample in enumerate(results['samples'], 1):
    print(f"\nSample {i}:")
    print(f"Instruction: {sample['instruction']}")
    if 'response' in sample:
        print(f"Response: {sample['response'][:100]}...")  # Show first 100 chars
    print(f"Generation: {sample['metadata']['generation']}")
    print(f"Strategy: {sample['metadata']['strategy']}")

7. Print metrics

print("\n=== Generation Metrics ===")
print(f"Total generations: {results['metrics']['generations']}")
print(f"Time taken: {results['metrics']['total_time']:.2f} seconds")
print(f"Samples generated: {results['metrics']['samples_generated']}")
print(f"Samples verified: {results['metrics']['samples_verified']}")

8. Document-grounded generation (optional)

documents = [
    "Machine learning is a subset of artificial intelligence...",
    "Neural networks are composed of layers of interconnected nodes..."
]

grounded_results = generator.generate_evolved_instructions(
    seed_instructions=seed_instructions,
    target_samples=5,
    max_generations=10,
    docs=documents,  # Provide documents for grounding
    domain_context="artificial intelligence",
    export_path="grounded_instructions.jsonl"
)

Documentation and Roadmap

For comprehensive documentation, visit: https://phinity.gitbook.io/phinity/use-cases/in-domain-sft

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.1.2

Mar 12, 2025

0.1.1

Mar 12, 2025

This version

0.1.0

Mar 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phinitydata-0.1.0.tar.gz (6.2 kB view details)

Uploaded Mar 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

phinitydata-0.1.0-py3-none-any.whl (4.6 kB view details)

Uploaded Mar 12, 2025 Python 3

File details

Details for the file phinitydata-0.1.0.tar.gz.

File metadata

Download URL: phinitydata-0.1.0.tar.gz
Upload date: Mar 12, 2025
Size: 6.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for phinitydata-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9dbefdf362a1ebe13b76cafeb0a4355975051b30595380c0f69e8e7b572f716c`
MD5	`07e0dcf331332b2b6dd347c55d0f0664`
BLAKE2b-256	`7ac6ba73de26b51b8263da86f1ad6904c7a5ee66176c5b8b9954995b2384528c`

See more details on using hashes here.

File details

Details for the file phinitydata-0.1.0-py3-none-any.whl.

File metadata

Download URL: phinitydata-0.1.0-py3-none-any.whl
Upload date: Mar 12, 2025
Size: 4.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for phinitydata-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea998e4247de6322d681f2903cc1d04ddc6b54905d5a851a95098e509a4442b8`
MD5	`69464ade4a34eb90a927d8fbcf2fe4e1`
BLAKE2b-256	`323e153053f85476423c3c588a1819f9ef96c0bf8d989222c5d84b16b04ec085`

See more details on using hashes here.

phinitydata 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Phinity

🎯 Key Features

Instruction Evolution Framework

Document/Knowledge Base Support

RAG Benchmark Generation

Quick Start

Installation

Basic Usage (Step-by-Step)

1. Import the necessary modules

2. Set up your OpenAI API key

3. Initialize the generator

4. Define your seed instructions

5. Generate instruction-response pairs

6. Access the generated data

7. Print metrics

8. Document-grounded generation (optional)

Documentation and Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes