Large Scale Topic based Synthetic Data Generation

Project description

Promptwright - Synthetic Dataset Generation

Model Distillation, Agent / Model Evaluations, and Statistical Research

Promptwright is a Python library designed for generating large synthetic datasets

The library offers a flexible and easy-to-use set of interfaces, enabling users the ability to generate prompt led synthetic datasets. This makes it suitable for a wide range of applications, from training machine learning models to creating realistic user simulations.

Features

Multiple Providers Support: Works with most LLM service providers and LocalLLM's via Ollama, VLLM etc
Configurable Instructions and Prompts: Define custom instructions and system prompts to craft distillation methods.
YAML Configuration: Define your generation tasks using YAML configuration files or use as a library.
Command Line Interface: Run generation tasks directly from the command line
Push to Hugging Face: Push the generated dataset to Hugging Face Hub with automatic dataset cards and tags

Topic Graphs (Experimental)

PromptWright now includes an experimental Topic Graph feature that extends beyond traditional hierarchical topic trees to support cross-connections between topics.

The Topic Graph uses a directed acyclic graph (DAG) in place of the Topic Tree. It allows for more complex and realistic relationships between topics, where a topic can have multiple parent topics and more connection density. This system is introduced as an experimental feature, designed to co-exist with the current TopicTree implementation, allowing for a gradual transition and comparative analysis.

Usage

YAML Configuration:

# Enable graph mode
topic_generator: graph

topic_graph:
  args:
    root_prompt: "Modern Software Architecture"
    provider: "ollama"
    model: "llama3"
    temperature: 0.7
    graph_degree: 3    # Subtopics per node
    graph_depth: 3     # Graph depth
  save_as: "software_graph.json"

Programmatic Usage:

from promptwright.topic_graph import TopicGraph, TopicGraphArguments

graph = TopicGraph(
    args=TopicGraphArguments(
        root_prompt="Machine Learning Fundamentals",
        model_name="ollama/llama3",
        temperature=0.7,
        graph_degree=3,
        graph_depth=2,
    )
)

graph.build()
graph.save("ml_graph.json")

# Optional: Generate visualization
graph.visualize("ml_graph")  # Creates ml_graph.svg

Getting Started

Prerequisites

Python 3.11+
uv (for dependency management)
(Optional) Hugging Face account and API token for dataset upload

Installation

pip

You can install Promptwright using pip:

pip install promptwright

Development Installation

To install the prerequisites, you can use the following commands:

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install promptwright and its dependencies
git clone https://github.com/lukehinds/promptwright.git
cd promptwright
uv sync --all-extras

Usage

Promptwright offers two ways to define and run your generation tasks:

1. Using YAML Configuration (Recommended)

Create a YAML file defining your generation task:

system_prompt: "You are a helpful assistant. You provide clear and concise answers to user questions."

topic_tree:
  args:
    root_prompt: "Capital Cities of the World."
    model_system_prompt: "<system_prompt_placeholder>"
    tree_degree: 3
    tree_depth: 2
    temperature: 0.7
    model_name: "ollama/mistral:latest"
  save_as: "basic_prompt_topictree.jsonl"

data_engine:
  args:
    instructions: "Please provide training examples with questions about capital cities."
    system_prompt: "<system_prompt_placeholder>"
    model_name: "ollama/mistral:latest"
    temperature: 0.9
    max_retries: 2

dataset:
  creation:
    num_steps: 5
    batch_size: 1
    model_name: "ollama/mistral:latest"
    sys_msg: true  # Include system message in dataset (default: true)
  save_as: "basic_prompt_dataset.jsonl"

# Optional Hugging Face Hub configuration
huggingface:
  # Repository in format "username/dataset-name"
  repository: "your-username/your-dataset-name"
  # Token can also be provided via HF_TOKEN environment variable or --hf-token CLI option
  token: "your-hf-token"
  # Additional tags for the dataset (optional)
  # "promptwright" and "synthetic" tags are added automatically
  tags:
    - "promptwright-generated-dataset"
    - "geography"

Run using the CLI:

promptwright start config.yaml

The CLI supports various options to override configuration values:

promptwright start config.yaml \
  --topic-tree-save-as output_tree.jsonl \
  --dataset-save-as output_dataset.jsonl \
  --model-name ollama/llama3 \
  --temperature 0.8 \
  --tree-degree 4 \
  --tree-depth 3 \
  --num-steps 10 \
  --batch-size 2 \
  --sys-msg true \  # Control system message inclusion (default: true)
  --hf-repo username/dataset-name \
  --hf-token your-token \
  --hf-tags tag1 --hf-tags tag2

Provider Integration

Promptwright uses LiteLLM to interface with LLM providers. You can specify the provider in the provider, model section in your config or code:

provider: "openai"  # LLM provider
    model: "gpt-4-1106-preview"  # Model name

Choose any of the listed providers here and following the same naming convention.

e.g.

The LiteLLM convention for Google Gemini would is:

from litellm import completion
import os

os.environ['GEMINI_API_KEY'] = ""
response = completion(
    model="gemini/gemini-pro", 
    messages=[{"role": "user", "content": "write code for saying hi from LiteLLM"}]
)

In Promptwright, you would specify the provider as gemini and the model as gemini-pro.

provider: "gemini"  # LLM provider
    model: "gemini-pro"  # Model name

For Ollama, you would specify the provider as ollama and the model as mistral and so on.

provider: "ollama"  # LLM provider
    model: "mistral:latest"  # Model name

API Keys

You can set the API key for the provider in the environment variable. The key should be set as PROVIDER_API_KEY. For example, for OpenAI, you would set the API key as OPENAI_API_KEY.

export OPENAI_API_KEY

Again, refer to the LiteLLM documentation for more information on setting up the API keys.

Hugging Face Hub Integration

Promptwright supports automatic dataset upload to the Hugging Face Hub with the following features:

Dataset Upload: Upload your generated dataset directly to Hugging Face Hub
Dataset Cards: Automatically creates and updates dataset cards
Automatic Tags: Adds "promptwright" and "synthetic" tags automatically
Custom Tags: Support for additional custom tags
Flexible Authentication: HF token can be provided via:
- CLI option: --hf-token your-token
- Environment variable: export HF_TOKEN=your-token
- YAML configuration: huggingface.token

Example using environment variable:

export HF_TOKEN=your-token
promptwright start config.yaml --hf-repo username/dataset-name

Or pass it in as a CLI option:

promptwright start config.yaml --hf-repo username/dataset-name --hf-token your-token

2. Using Python Code

You can also create generation tasks programmatically using Python code. There are several examples in the examples directory that demonstrate this approach.

Example Python usage:

from promptwright import DataEngine, EngineArguments, TopicTree, TopicTreeArguments

tree = TopicTree(
    args=TopicTreeArguments(
        root_prompt="Creative Writing Prompts",
        model_system_prompt=system_prompt,
        tree_degree=5,
        tree_depth=4,
        temperature=0.9,
        model_name="ollama/llama3"
    )
)

engine = DataEngine(
    args=EngineArguments(
        instructions="Generate creative writing prompts and example responses.",
        system_prompt="You are a creative writing instructor providing writing prompts and example responses.",
        model_name="ollama/llama3",
        temperature=0.9,
        max_retries=2,
        sys_msg=True,  # Include system message in dataset (default: true)
    )
)

Development

The project uses Poetry for dependency management. Here are some common development commands:

# Install dependencies including development dependencies
make install

# Format code
make format

# Run linting
make lint

# Run tests
make test

# Run security checks
make security

# Build the package
make build

# Run all checks and build
make all

Prompt Output Examples

With sys_msg=true (default):

{
  "messages": [
    {
      "role": "system",
      "content": "You are tasked with designing an immersive virtual reality experience that transports users to a fantastical world of wonder."
    },
    {
      "role": "user",
      "content": "Create a descriptive passage about a character discovering their hidden talents."
    },
    {
      "role": "assistant",
      "content": "As she stared at the canvas, Emma's fingers hovered above the paintbrushes, as if hesitant to unleash the colors that had been locked within her. The strokes began with bold abandon, swirling blues and greens merging into a mesmerizing dance of light and shadow. With each passing moment, she felt herself becoming the art – her very essence seeping onto the canvas like watercolors in a spring storm. The world around her melted away, leaving only the vibrant symphony of color and creation."
    }
  ]
}

With sys_msg=false:

{
  "messages": [
    {
      "role": "user",
      "content": "Create a descriptive passage about a character discovering their hidden talents."
    },
    {
      "role": "assistant",
      "content": "As she stared at the canvas, Emma's fingers hovered above the paintbrushes, as if hesitant to unleash the colors that had been locked within her. The strokes began with bold abandon, swirling blues and greens merging into a mesmerizing dance of light and shadow. With each passing moment, she felt herself becoming the art – her very essence seeping onto the canvas like watercolors in a spring storm. The world around her melted away, leaving only the vibrant symphony of color and creation."
    }
  ]
}

Unpredictable Behavior

The library is designed to generate synthetic data based on the prompts and instructions provided. The quality of the generated data is dependent on the quality of the prompts and the model used. The library does not guarantee the quality of the generated data.

Large Language Models can sometimes generate unpredictable or inappropriate content and the authors of this library are not responsible for the content generated by the models. We recommend reviewing the generated data before using it in any production environment.

Large Language Models also have the potential to fail to stick with the behavior defined by the prompt around JSON formatting, and may generate invalid JSON. This is a known issue with the underlying model and not the library. We handle these errors by retrying the generation process and filtering out invalid JSON. The failure rate is low, but it can happen. We report on each failure within a final summary.

Contributing

If something here could be improved, please open an issue or submit a pull request.

Inspiration

Promptwright was inspired by the redotvideo/pluto, in fact it started as fork, but ended up largely being a re-write.

License

This project is licensed under the Apache 2 License. See the LICENSE file for more details.

Project details

Release history Release notifications | RSS feed

This version

1.5.0

Sep 11, 2025

1.4.1

Aug 29, 2025

1.3.1

Nov 25, 2024

1.2.1

Nov 25, 2024

1.1.1

Nov 24, 2024

1.0.1

Nov 8, 2024

1.0.0

Nov 8, 2024

0.1.5

Oct 27, 2024

0.1.4

Oct 27, 2024

0.1.3

Oct 27, 2024

0.1.2

Oct 26, 2024

0.1.0

Oct 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptwright-1.5.0.tar.gz (481.8 kB view details)

Uploaded Sep 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

promptwright-1.5.0-py3-none-any.whl (39.4 kB view details)

Uploaded Sep 11, 2025 Python 3

File details

Details for the file promptwright-1.5.0.tar.gz.

File metadata

Download URL: promptwright-1.5.0.tar.gz
Upload date: Sep 11, 2025
Size: 481.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for promptwright-1.5.0.tar.gz
Algorithm	Hash digest
SHA256	`3e343725c55e4b9e3f569cb65fe49d98530f636421d6af4c8c5b9385a1344ef1`
MD5	`9c574fb96907a65656403b5fce61f1ce`
BLAKE2b-256	`1cdafc4f67250fc6c948ca36b5d0ac871f9c0310cfa24ad15774777e2e3a3d03`

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptwright-1.5.0.tar.gz:

Publisher: publish.yml on lukehinds/promptwright

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: promptwright-1.5.0.tar.gz
- Subject digest: 3e343725c55e4b9e3f569cb65fe49d98530f636421d6af4c8c5b9385a1344ef1
- Sigstore transparency entry: 500839104
- Sigstore integration time: Sep 11, 2025
Source repository:
- Permalink: lukehinds/promptwright@509f91b3ebffae0818a8d9a049160d67465984d6
- Branch / Tag: refs/tags/v1.5.0
- Owner: https://github.com/lukehinds
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@509f91b3ebffae0818a8d9a049160d67465984d6
- Trigger Event: release

File details

Details for the file promptwright-1.5.0-py3-none-any.whl.

File metadata

Download URL: promptwright-1.5.0-py3-none-any.whl
Upload date: Sep 11, 2025
Size: 39.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for promptwright-1.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a23553eba43c8d92f691ecda2a99aa7ede6bffda15608c5a0b6bd0c3134c48e`
MD5	`1fc5da429b3fbdfb2d692a0b688db0cf`
BLAKE2b-256	`59443e859974755a11b097ac705c1446855a93e82f6ebe92600ceb3b3c8ca56f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptwright-1.5.0-py3-none-any.whl:

Publisher: publish.yml on lukehinds/promptwright

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: promptwright-1.5.0-py3-none-any.whl
- Subject digest: 1a23553eba43c8d92f691ecda2a99aa7ede6bffda15608c5a0b6bd0c3134c48e
- Sigstore transparency entry: 500839116
- Sigstore integration time: Sep 11, 2025
Source repository:
- Permalink: lukehinds/promptwright@509f91b3ebffae0818a8d9a049160d67465984d6
- Branch / Tag: refs/tags/v1.5.0
- Owner: https://github.com/lukehinds
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@509f91b3ebffae0818a8d9a049160d67465984d6
- Trigger Event: release

promptwright 1.5.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Promptwright - Synthetic Dataset Generation

Model Distillation, Agent / Model Evaluations, and Statistical Research

Features

Topic Graphs (Experimental)

Usage

Getting Started

Prerequisites

Installation

pip

Development Installation

Usage

1. Using YAML Configuration (Recommended)

Provider Integration

API Keys

Hugging Face Hub Integration

2. Using Python Code

Development

Prompt Output Examples

Unpredictable Behavior

Contributing

Inspiration

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance