Skip to main content

Optimize Pydantic model field descriptions using DSPy

Project description

DSPydantic

Automatically optimize Pydantic model field descriptions and prompts using DSPy. Get better structured data extraction from LLMs with less manual tuning.

What is DSPydantic?

When building LLM applications that extract structured data, getting the right field descriptions and prompts is crucial. Instead of manually tweaking descriptions, dspydantic uses DSPy to automatically find the best descriptions and prompts based on your examples.

Quick Start

from pydantic import BaseModel, Field
from dspydantic import PydanticOptimizer, Example, create_optimized_model

# Define your Pydantic model
class User(BaseModel):
    name: str = Field(description="User name")
    age: int = Field(description="User age")
    email: str = Field(description="Email address")

# Provide examples with text input and expected Pydantic models
examples = [
    Example(
        text="John Doe, 30 years old, john@example.com",
        expected_output=User(name="John Doe", age=30, email="john@example.com")
    ),
    Example(
        text="Jane Smith, 25, jane.smith@email.com",
        expected_output=User(name="Jane Smith", age=25, email="jane.smith@email.com")
    ),
]

# Optimize field descriptions
optimizer = PydanticOptimizer(
    model=User,
    examples=examples,
    evaluate_fn="exact",  # Built-in exact matching
    model_id="gpt-4o",
    api_key="your-api-key",  # Or set OPENAI_API_KEY env var
)

result = optimizer.optimize()

# View optimized descriptions
print("Optimized descriptions:")
for field, description in result.optimized_descriptions.items():
    print(f"  {field}: {description}")

# Create optimized model with updated descriptions
OptimizedUser = create_optimized_model(User, result.optimized_descriptions)

# Use the optimized model directly
user = OptimizedUser(name="John Doe", age=30, email="john@example.com")
print(f"Model schema: {OptimizedUser.model_json_schema()}")

Installation

pip install dspydantic

Or using uv:

uv pip install dspydantic

Basic Usage

1. Define Your Pydantic Model

from pydantic import BaseModel, Field

class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice ID")
    total_amount: float = Field(description="Total amount")
    date: str = Field(description="Invoice date")

2. Create Examples

Use plain text and Pydantic model instances:

from dspydantic import Example

examples = [
    Example(
        text="Invoice #INV-2024-001, Total: $1,234.56, Date: 2024-01-15",
        expected_output=Invoice(
            invoice_number="INV-2024-001",
            total_amount=1234.56,
            date="2024-01-15"
        )
    ),
    Example(
        text="Invoice #INV-2024-002, Total: $567.89, Date: 2024-01-20",
        expected_output=Invoice(
            invoice_number="INV-2024-002",
            total_amount=567.89,
            date="2024-01-20"
        )
    ),
]

3. Optimize

from dspydantic import PydanticOptimizer

optimizer = PydanticOptimizer(
    model=Invoice,
    examples=examples,
    instruction_prompt="Extract the invoice data from the text.",
    system_prompt="You are a helpful assistant that extracts invoice data from text.",
    evaluate_fn="exact",
    model_id="gpt-4o",
    verbose=True
)

result = optimizer.optimize()

4. Use Optimized Descriptions

You can create a new optimized model class with the optimized descriptions applied directly:

from dspydantic import create_optimized_model
from openai import OpenAI

# Create optimized model class with updated Field descriptions
OptimizedInvoice = create_optimized_model(Invoice, result.optimized_descriptions)

# Use the optimized model directly - it has optimized descriptions in Field definitions
# The optimized model works exactly like the original, but with better descriptions
optimized_schema = OptimizedInvoice.model_json_schema()

# Use with OpenAI structured outputs
# Include optimized system and instruction prompts if they were optimized
client = OpenAI()
messages = []
if result.optimized_system_prompt:
    messages.append({"role": "system", "content": result.optimized_system_prompt})

user_content = "Extract: INV-2024-001, $1,234.56, 2024-01-15"
if result.optimized_instruction_prompt:
    user_content = f"{result.optimized_instruction_prompt}\n\n{user_content}"
messages.append({"role": "user", "content": user_content})

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    response_format=OptimizedInvoice
    
)

# Parse response using the optimized model
invoice = OptimizedInvoice.model_validate_json(response.choices[0].message.content)

Alternatively, you can use apply_optimized_descriptions to get just the JSON schema without creating a new model class (useful for one-off schema generation).

Working with Images

from pydantic import BaseModel, Field
from typing import Literal
from dspydantic import Example, PydanticOptimizer, create_optimized_model

class DigitClassification(BaseModel):
    digit: Literal[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] = Field(
        description="The digit shown in the image (0-9)"
    )

examples = [
    Example(
        image_path="digit_5.png",
        expected_output=DigitClassification(digit=5)
    ),
    Example(
        image_path="digit_3.png",
        expected_output=DigitClassification(digit=3)
    ),
]

optimizer = PydanticOptimizer(
    model=DigitClassification,
    examples=examples,
    evaluate_fn="exact",
    model_id="gpt-4o"
)

result = optimizer.optimize()

# Create optimized model
OptimizedDigitClassification = create_optimized_model(
    DigitClassification, result.optimized_descriptions
)

# Use the optimized model
digit = OptimizedDigitClassification(digit=5)

Working with PDFs

examples = [
    Example(
        pdf_path="invoice_001.pdf",
        pdf_dpi=300,  # Optional, default is 300
        expected_output=Invoice(
            invoice_number="INV-2024-001",
            total_amount=1234.56,
            date="2024-01-15"
        )
    ),
]

Nested Models

Nested models work automatically:

class Address(BaseModel):
    street: str = Field(description="Street address")
    city: str = Field(description="City name")
    zip_code: str = Field(description="ZIP code")

class User(BaseModel):
    name: str = Field(description="User name")
    address: Address = Field(description="User address")

examples = [
    Example(
        text="John Doe, 123 Main St, New York, 10001",
        expected_output=User(
            name="John Doe",
            address=Address(street="123 Main St", city="New York", zip_code="10001")
        )
    ),
]

Field paths will automatically be: "name", "address.street", "address.city", "address.zip_code".

Custom Evaluation

You can provide your own evaluation function:

def evaluate(
    example: Example,
    optimized_descriptions: dict[str, str],
    optimized_system_prompt: str | None,
    optimized_instruction_prompt: str | None,
) -> float:
    """
    Evaluate how well the optimized prompts work.
    
    Returns a score between 0.0 and 1.0.
    """
    # Your evaluation logic here
    # Use optimized_descriptions and prompts with your LLM
    # Compare results with example.expected_output
    return 0.85

optimizer = PydanticOptimizer(
    model=User,
    examples=examples,
    evaluate_fn=evaluate,
    model_id="gpt-4o"
)

Optimizing Prompts

You can also optimize system and instruction prompts:

optimizer = PydanticOptimizer(
    model=User,
    examples=examples,
    evaluate_fn="exact",
    system_prompt="You are a helpful assistant that extracts information.",
    instruction_prompt="Extract the user information from the input text.",
    model_id="gpt-4o"
)

result = optimizer.optimize()

# Create optimized model with updated descriptions
from dspydantic import create_optimized_model
OptimizedUser = create_optimized_model(User, result.optimized_descriptions)

# Access optimized prompts
print("Optimized system prompt:", result.optimized_system_prompt)
print("Optimized instruction prompt:", result.optimized_instruction_prompt)
print("Optimized descriptions:", result.optimized_descriptions)

# Use the optimized model and prompts with your LLM
from openai import OpenAI

client = OpenAI()
messages = []
if result.optimized_system_prompt:
    messages.append({"role": "system", "content": result.optimized_system_prompt})

user_content = "John Doe, 123 Main St, New York, 10001"
if result.optimized_instruction_prompt:
    user_content = f"{result.optimized_instruction_prompt}\n\n{user_content}"
messages.append({"role": "user", "content": user_content})

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": OptimizedUser.__name__,
            "schema": OptimizedUser.model_json_schema(),
            "strict": True
        }
    }
)

# Parse response using the optimized model
user = OptimizedUser.model_validate_json(response.choices[0].message.content)

Built-in Evaluation Options

Instead of writing a custom evaluation function, you can use built-in options:

  • "exact": Exact matching between extracted and expected values
  • "levenshtein": Fuzzy matching using Levenshtein distance
optimizer = PydanticOptimizer(
    model=User,
    examples=examples,
    evaluate_fn="exact",  # or "levenshtein"
    model_id="gpt-4o"
)

Examples

See the examples directory for complete working examples:

API Reference

PydanticOptimizer

Main optimizer class.

Parameters:

  • model (type[BaseModel]): The Pydantic model class to optimize
  • examples (list[Example]): List of examples for optimization
  • evaluate_fn (Callable | str | None): Evaluation function or built-in option ("exact", "levenshtein"). If None, uses default evaluation.
  • system_prompt (str | None): Optional initial system prompt to optimize
  • instruction_prompt (str | None): Optional initial instruction prompt to optimize
  • model_id (str): LLM model ID (default: "gpt-4o")
  • api_key (str | None): API key (default: from OPENAI_API_KEY env var)
  • verbose (bool): Print progress (default: False)
  • optimizer_type (str): Optimizer type (default: "miprov2zeroshot")
  • num_threads (int): Number of optimization threads (default: 4)

Returns:

  • OptimizationResult: Contains optimized descriptions, prompts, and metrics

Example

Example data for optimization.

Parameters:

  • expected_output (dict | BaseModel): Expected output as a Pydantic model instance or dict
  • text (str | None): Plain text input
  • image_path (str | Path | None): Path to an image file
  • image_base64 (str | None): Base64-encoded image string
  • pdf_path (str | Path | None): Path to a PDF file
  • pdf_dpi (int): DPI for PDF conversion (default: 300)

Example:

# Text input
Example(
    text="John Doe, 30 years old",
    expected_output=User(name="John Doe", age=30)
)

# Image input
Example(
    image_path="document.png",
    expected_output=User(name="John Doe", age=30)
)

# PDF input
Example(
    pdf_path="document.pdf",
    expected_output=User(name="John Doe", age=30)
)

# Combined text and image
Example(
    text="Extract information from this document",
    image_path="document.png",
    expected_output=User(name="John Doe", age=30)
)

create_optimized_model(model, optimized_descriptions)

Create a new Pydantic model class with optimized field descriptions applied directly to Field definitions. This is the recommended way to use optimized descriptions.

Parameters:

  • model (type[BaseModel]): The original Pydantic model class
  • optimized_descriptions (dict[str, str]): Dictionary mapping field paths to optimized descriptions

Returns:

  • type[BaseModel]: A new Pydantic model class with optimized descriptions in Field definitions

Example:

from dspydantic import create_optimized_model

# Create optimized model class
OptimizedInvoice = create_optimized_model(Invoice, result.optimized_descriptions)

# Use the optimized model directly - it works exactly like the original
# but with optimized descriptions embedded in Field definitions
invoice = OptimizedInvoice(
    invoice_number="INV-2024-001",
    total_amount=1234.56,
    date="2024-01-15"
)

# Get JSON schema with optimized descriptions
optimized_schema = OptimizedInvoice.model_json_schema()

# Use with OpenAI structured outputs
# Include optimized prompts if available
messages = []
if result.optimized_system_prompt:
    messages.append({"role": "system", "content": result.optimized_system_prompt})

user_content = "Extract invoice data..."
if result.optimized_instruction_prompt:
    user_content = f"{result.optimized_instruction_prompt}\n\n{user_content}"
messages.append({"role": "user", "content": user_content})

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": OptimizedInvoice.__name__,
            "schema": optimized_schema,
            "strict": True
        }
    }
)

apply_optimized_descriptions(model, optimized_descriptions)

Create a JSON schema dictionary with optimized field descriptions. Useful for one-off schema generation without creating a new model class.

Parameters:

  • model (type[BaseModel]): The original Pydantic model class
  • optimized_descriptions (dict[str, str]): Dictionary mapping field paths to optimized descriptions

Returns:

  • dict: JSON schema dictionary with optimized descriptions

Example:

from dspydantic import apply_optimized_descriptions

# Get optimized schema without creating a new model class
optimized_schema = apply_optimized_descriptions(Invoice, result.optimized_descriptions)

# Use with OpenAI
# Include optimized prompts if available
messages = []
if result.optimized_system_prompt:
    messages.append({"role": "system", "content": result.optimized_system_prompt})

user_content = "Extract invoice data..."
if result.optimized_instruction_prompt:
    user_content = f"{result.optimized_instruction_prompt}\n\n{user_content}"
messages.append({"role": "user", "content": user_content})

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": Invoice.__name__,
            "schema": optimized_schema,
            "strict": True
        }
    }
)

License

Apache 2.0

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dspydantic-0.0.3.tar.gz (205.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dspydantic-0.0.3-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file dspydantic-0.0.3.tar.gz.

File metadata

  • Download URL: dspydantic-0.0.3.tar.gz
  • Upload date:
  • Size: 205.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.18

File hashes

Hashes for dspydantic-0.0.3.tar.gz
Algorithm Hash digest
SHA256 93189db57f2c9948ff3038975e331f9f03b7470677be7f8954c9c44d638a1037
MD5 a496054976376ed25abe79ae85b3cc36
BLAKE2b-256 f8aec1965965f01c68bc0dc9fded738f9d764700269384718e27d98c1fbb9da2

See more details on using hashes here.

File details

Details for the file dspydantic-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: dspydantic-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 26.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.18

File hashes

Hashes for dspydantic-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7c813180b8e1aafec7db9a9f37c292726cbb254870f5c12182abbbb22568077f
MD5 8ba785282ca34a4c28b61df597780f98
BLAKE2b-256 aebddf647afe81f2dd997c5144cbb2cd5b908591f93155616140794d2d52bc48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page