No project description provided

These details have not been verified by PyPI

Project description

Starfish Logo

Starfish

Synthetic Data Generation Made Easy

Overview

Starfish is a Python library that helps you build synthetic data your way. We adapt to your workflow—not the other way around. By combining structured LLM outputs with efficient parallel processing, Starfish lets you define exactly how your data should look and scale seamlessly from experiments to production.

⭐ Star us on GitHub if you find this project useful!

Key Features:

Structured Outputs: First-class support for structured data through JSON schemas or Pydantic models.
Model Flexibility: Use any LLM provider—local models, OpenAI, Anthropic, or your own implementation via LiteLLM.
Dynamic Prompts: Dynamic prompts with built-in Jinja2 templates.
Easy Scaling: Transform any function to run in parallel across thousands of inputs with a single decorator.
Resilient Pipeline: Automatic retries, error handling, and job resumption—pause and continue your data generation anytime.
Complete Control: Share state across your pipeline, extend functionality with custom hooks.

Official Website: starfishdata.ai - We offer both self-service and managed solutions. Visit our website to explore our services or contact us for more options!

Installation

pip install starfish-core

Optional Dependencies

Starfish supports optional dependencies for specific file parsers. Install only what you need:

# Install specific parsers
pip install "starfish-core[pdf]"       # PDF support
pip install "starfish-core[docx]"      # Word document support
pip install "starfish-core[ppt]"       # PowerPoint support
pip install "starfish-core[excel]"     # Excel support
pip install "starfish-core[youtube]"   # YouTube support

# Install all parser dependencies
pip install "starfish-core[all]"

Configuration

Starfish uses environment variables for configuration. We provide a .env.template file to help you get started quickly:

# Copy the template to .env
cp .env.template .env

# Edit with your API keys and configuration
nano .env  # or use your preferred editor

The template includes settings for API keys, model configurations, and other runtime parameters.

Quick Start

Structured LLM - Type-Safe Outputs from Any Model

# 1. Define structured outputs with schema
from starfish import StructuredLLM
from pydantic import BaseModel

# Option A: Use Pydantic for type safety
class QnASchema(BaseModel):
    question: str
    answer: str

# Option B: Or use simple JSON schema
json_schema = [
    {'name': 'question', 'type': 'str'},
    {'name': 'answer', 'type': 'str'}, 
]

# 2. Create a structured LLM with your preferred output format
qna_llm = StructuredLLM(
    model_name="openai/gpt-4o-mini",
    prompt="Generate facts about {{city}}",
    output_schema=QnASchema  # or json_schema
)

# 3. Get structured responses
response = await qna_llm.run(city="San Francisco")

# Access typed data
print(response.data)
# [{'question': 'What is the iconic symbol of San Francisco?',
#   'answer': 'The Golden Gate Bridge is the iconic symbol of San Francisco, completed in 1937.'}]

# Access raw API response for complete flexibility
print(response.raw)  # Full API object with function calls, reasoning tokens, etc.

Data Factory - Scale Any Workflow with One Decorator

# Turn any function into a scalable data pipeline
from starfish import data_factory

# Works with any function - simple or complex workflows
@data_factory(max_concurrency=50)
async def parallel_qna_llm(city):
    # This could be any arbitrary complex workflow:
    # - Pre-processing
    # - Multiple LLM calls
    # - Post-processing
    # - Error handling
    response = await qna_llm.run(city=city)
    return response.data

# Process 100 cities with 50 concurrent workers - finishes in seconds
cities = ["San Francisco", "New York", "Tokyo", "Paris", "London"] * 20
results = parallel_qna_llm.run(city=cities)

# dry run to test the workflow and data
results = parallel_qna_llm.dry_run(city=cities)
    
# resume job which pick up from where it left off. 
results = parallel_qna_llm.resume()

Examples

Check out our example notebooks for detailed walkthroughs:

Documentation

Comprehensive documentation is on the way!

Contributing

We'd love your help making Starfish better! Whether you're fixing bugs, adding features, or improving documentation, your contributions are welcome.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Contribution guidelines coming soon!

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contact

If you have any questions or feedback, feel free to reach out to us at founders@starfishdata.ai.

Want to discuss your use case directly? Schedule a meeting with our team.

Telemetry

Starfish collects minimal and anonymous telemetry data to help improve the library. Participation is optional and you can opt out by setting TELEMETRY_ENABLED=false in your environment variables.

Citation

If you use Starfish in your research, please consider citing us!

@software{starfish,
  author = {Wendao, John, Ayush},
  title = {{Starfish: A Tool for Synthetic Data Generation}},
  year = {2025},
  url = {https://github.com/starfishdata/starfish},
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

May 28, 2025

0.1.2

May 8, 2025

0.1.1

Apr 25, 2025

0.1.0

Apr 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

starfish_core-0.1.3.tar.gz (104.1 kB view details)

Uploaded May 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

starfish_core-0.1.3-py3-none-any.whl (139.3 kB view details)

Uploaded May 28, 2025 Python 3

File details

Details for the file starfish_core-0.1.3.tar.gz.

File metadata

Download URL: starfish_core-0.1.3.tar.gz
Upload date: May 28, 2025
Size: 104.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/24.4.0

File hashes

Hashes for starfish_core-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`2b825aa681d650e253c395e9965c17b70616deb1249270f3b24090cdff02fcf9`
MD5	`d7194a0e88152ecd370e9da2790a3d85`
BLAKE2b-256	`b5602373269d679474f5a877fc7f2194bf4bc28c559b30bfb2dd54cdfe0f464b`

See more details on using hashes here.

File details

Details for the file starfish_core-0.1.3-py3-none-any.whl.

File metadata

Download URL: starfish_core-0.1.3-py3-none-any.whl
Upload date: May 28, 2025
Size: 139.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/24.4.0

File hashes

Hashes for starfish_core-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1bc77b05e5c9f8d6697a5e8ae88e3106a56e9e92ae342173dc780879804c44fc`
MD5	`19ec984113543cbba6630282b228b83a`
BLAKE2b-256	`27c10443bf5152ccba8986a4c92fc02ddb8bc127033f1471780a65cf0c2fffe7`

See more details on using hashes here.

starfish-core 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Starfish

Synthetic Data Generation Made Easy

Overview

Installation

Optional Dependencies

Configuration

Quick Start

Structured LLM - Type-Safe Outputs from Any Model

Data Factory - Scale Any Workflow with One Decorator

Examples

Documentation

Contributing

License

Contact

Telemetry

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes