Optimize Pydantic model field descriptions using DSPy

These details have not been verified by PyPI

Project description

dspydantic

Optimize information extraction prompts and Pydantic field descriptions using DSPy.

Overview

dspydantic is a library that uses DSPy to automatically optimize field descriptions in Pydantic models. By providing example data and an evaluation function, the library iteratively improves field descriptions to achieve better structured output quality from LLMs.

Why dspydantic?

When building LLM-powered applications that extract structured data, getting the right field descriptions and prompts is crucial for accuracy. Manually crafting and iterating on these descriptions is time-consuming and often suboptimal. dspydantic solves this by:

Automatically optimizing field descriptions: Instead of manually tweaking descriptions, provide examples and let DSPy find the optimal descriptions that maximize extraction accuracy
Optimizing system and instruction prompts: Beyond field descriptions, optimize the entire prompt structure for better results
Data-driven improvements: Uses your actual data and evaluation metrics to iteratively improve, rather than relying on intuition
Works with any input format: Supports plain text, images (including PDFs converted to images), and combinations thereof

Use Cases

dspydantic is particularly useful for:

Document extraction: Extract structured data from invoices, forms, medical records, or any document format
Image analysis: Extract information from images, diagrams, or scanned documents
Text classification: Optimize models for sentiment analysis, categorization, or any text classification task
Multi-modal extraction: Combine text and images for complex extraction scenarios

See the examples directory for complete working examples:

Text extraction: Extract structured information from veterinary EHR text (PetEVAL dataset)
Image classification: Classify handwritten digits from MNIST images
Sentiment analysis: Classify movie review sentiment from IMDB dataset

Features

🔄 Automatic Optimization: Uses DSPy to optimize Pydantic field descriptions
📊 Custom Evaluation: Define your own evaluation function to measure quality
🎯 Multiple Optimizers: Support for MIPROv2, GEPA, BootstrapFewShot, and more
🔧 Easy Integration: Simple API that works with any Pydantic model
📝 Recursive Support: Handles nested models and arrays of objects

Installation

pip install dspydantic

Or using uv:

uv pip install dspydantic

Quick Start

Text Input Example

from pydantic import BaseModel, Field
from dspydantic import PydanticOptimizer, Example

# Define your Pydantic model
class User(BaseModel):
    name: str = Field(description="User name")
    age: int = Field(description="User age")
    email: str = Field(description="Email address")

# Prepare examples with text input
examples = [
    Example(
        input_data={"text": "John Doe, 30 years old, john@example.com"},
        expected_output={"name": "John Doe", "age": 30, "email": "john@example.com"}
    ),
    Example(
        input_data={"text": "Jane Smith, 25, jane.smith@email.com"},
        expected_output={"name": "Jane Smith", "age": 25, "email": "jane.smith@email.com"}
    ),
]

# Optimize with built-in evaluation (uses "exact" matching)
optimizer = PydanticOptimizer(
    model=User,
    examples=examples,
    evaluate_fn="exact",  # Built-in exact matching evaluation
    model_id="gpt-4o",
    api_key="your-api-key",  # Or set OPENAI_API_KEY env var
    verbose=True
)

# Run optimization
result = optimizer.optimize()

# View optimized descriptions
print("Optimized descriptions:")
for field, description in result.optimized_descriptions.items():
    print(f"  {field}: {description}")

# Use optimized descriptions with OpenAI structured outputs
from dspydantic import apply_optimized_descriptions
from openai import OpenAI

optimized_schema = apply_optimized_descriptions(User, result.optimized_descriptions)
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract: Alice Johnson, 28, alice@example.com"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": User.__name__,
            "schema": optimized_schema,
            "strict": True
        }
    }
)

Image Input Example

from pydantic import BaseModel, Field
from dspydantic import PydanticOptimizer, Example, prepare_input_data
from typing import Literal

# Define model for image classification
class DigitClassification(BaseModel):
    digit: Literal[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] = Field(
        description="The digit shown in the image (0-9)"
    )

# Prepare examples with image input
examples = [
    Example(
        input_data=prepare_input_data(image_path="digit_5.png"),
        expected_output={"digit": 5}
    ),
    Example(
        input_data=prepare_input_data(image_path="digit_3.png"),
        expected_output={"digit": 3}
    ),
]

# Optimize
optimizer = PydanticOptimizer(
    model=DigitClassification,
    examples=examples,
    evaluate_fn="exact",
    model_id="gpt-4o",
    api_key="your-api-key",
    verbose=True
)

result = optimizer.optimize()

PDF Input Example

from pydantic import BaseModel, Field
from dspydantic import PydanticOptimizer, Example, prepare_input_data

# Define model for invoice extraction
class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice number")
    total_amount: float = Field(description="Total amount")
    date: str = Field(description="Invoice date")

# Prepare examples with PDF input
examples = [
    Example(
        input_data=prepare_input_data(pdf_path="invoice_001.pdf"),
        expected_output={
            "invoice_number": "INV-2024-001",
            "total_amount": 1234.56,
            "date": "2024-01-15"
        }
    ),
    Example(
        input_data=prepare_input_data(pdf_path="invoice_002.pdf"),
        expected_output={
            "invoice_number": "INV-2024-002",
            "total_amount": 567.89,
            "date": "2024-01-20"
        }
    ),
]

# Optimize
optimizer = PydanticOptimizer(
    model=Invoice,
    examples=examples,
    evaluate_fn="exact",
    model_id="gpt-4o",
    api_key="your-api-key",
    verbose=True
)

result = optimizer.optimize()

Combined Text and Image Example

from dspydantic import prepare_input_data, Example

# Combine text and image in a single example
examples = [
    Example(
        input_data=prepare_input_data(
            text="Extract information from this receipt",
            image_path="receipt.png"
        ),
        expected_output={"total": 45.99, "merchant": "Coffee Shop"}
    ),
]

Usage

Basic Example

from pydantic import BaseModel, Field
from dspydantic import PydanticOptimizer, Example, extract_field_descriptions, apply_optimized_descriptions

class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice ID")
    total_amount: float = Field(description="Total amount")
    date: str = Field(description="Invoice date")

# Step 1: Inspect current field descriptions (optional)
current_descriptions = extract_field_descriptions(Invoice)
print("Current descriptions:", current_descriptions)
# Output: {
#     "invoice_number": "Invoice ID",
#     "total_amount": "Total amount",
#     "date": "Invoice date"
# }

# Step 2: Prepare examples
examples = [
    Example(
        input_data={"text": "Invoice #INV-2024-001, Total: $1,234.56, Date: 2024-01-15"},
        expected_output={
            "invoice_number": "INV-2024-001",
            "total_amount": 1234.56,
            "date": "2024-01-15"
        }
    ),
    # Add more examples...
]

# Step 3: Optimize field descriptions
optimizer = PydanticOptimizer(
    model=Invoice,
    examples=examples,
    evaluate_fn="exact",  # Use built-in exact matching evaluation
    model_id="gpt-4o"
)
result = optimizer.optimize()

# Step 4: View optimized descriptions
print("\nOptimized descriptions:")
for field_path, description in result.optimized_descriptions.items():
    print(f"  {field_path}: {description}")

# Step 5: Apply optimized descriptions to create a JSON schema
optimized_schema = apply_optimized_descriptions(Invoice, result.optimized_descriptions)

# Step 6: Use with OpenAI structured outputs
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Extract invoice data from: INV-2024-001, $1,234.56, 2024-01-15"}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": Invoice.__name__,
            "schema": optimized_schema,
            "strict": True
        }
    }
)

extracted_data = response.choices[0].message.content
print("\nExtracted data:", extracted_data)

Custom Evaluation Function

The evaluation function receives an Example, optimized field descriptions, and optimized prompts, and should return a score between 0.0 and 1.0:

def evaluate(
    example: Example,
    optimized_descriptions: dict[str, str],
    optimized_system_prompt: str | None,
    optimized_instruction_prompt: str | None,
) -> float:
    """
    Evaluate how well the optimized prompts and descriptions work.
    
    Args:
        example: The example with input_data and expected_output
        optimized_descriptions: Dictionary of field paths to optimized descriptions
        optimized_system_prompt: Optimized system prompt (None if not provided)
        optimized_instruction_prompt: Optimized instruction prompt (None if not provided)
    
    Returns:
        Score between 0.0 and 1.0
    """
    # Example: Use an LLM to extract data and compare with expected output
    # Use the optimized prompts and descriptions with your LLM
    # This is a simplified example - your actual implementation would
    # call your LLM with the optimized prompts/descriptions and compare results
    
    # For demonstration, return a mock score
    return 0.85

System and Instruction Prompts

You can optimize system prompts and instruction prompts alongside field descriptions:

optimizer = PydanticOptimizer(
    model=User,
    examples=examples,
    evaluate_fn=evaluate,
    system_prompt="You are a helpful assistant that extracts information.",
    instruction_prompt="Extract the user information from the input text.",
    model_id="gpt-4o"
)

result = optimizer.optimize()

# Access optimized prompts
print(result.optimized_system_prompt)
print(result.optimized_instruction_prompt)
print(result.optimized_descriptions)

Custom DSPy Language Model

You can pass any DSPy language model directly instead of using model_id:

import dspy
from dspydantic import PydanticOptimizer, Example

# Create a custom DSPy LM with any configuration
custom_lm = dspy.LM(
    "gpt-4o",
    api_key="your-key",
    api_base="https://custom-endpoint.com",  # For custom endpoints
    api_version="2024-01-01",  # For Azure
    # ... any other DSPy LM parameters
)

optimizer = PydanticOptimizer(
    model=User,
    examples=examples,
    evaluate_fn=evaluate,
    lm=custom_lm,  # Pass your custom LM
    verbose=True
)

This is useful when you need:

Custom API endpoints
Special LM configurations
Reusing an existing LM instance
Using DSPy's advanced LM features

Optimizer Types

Choose from different DSPy optimizers:

"miprov2zeroshot" (default): MIPROv2 configured for 0-shot optimization
"miprov2": Full MIPROv2 optimization
"gepa": GEPA optimizer
"bootstrapfewshot": BootstrapFewShot optimizer
"bootstrapfewshotwithrandomsearch": BootstrapFewShotWithRandomSearch

optimizer = PydanticOptimizer(
    model=User,
    examples=examples,
    evaluate_fn=evaluate,
    optimizer_type="miprov2",  # Choose optimizer
    num_threads=4,
    verbose=True
)

Nested Models

The library automatically handles nested Pydantic models:

class Address(BaseModel):
    street: str = Field(description="Street address")
    city: str = Field(description="City name")

class User(BaseModel):
    name: str = Field(description="User name")
    address: Address = Field(description="User address")

# Field paths will be: "name", "address.street", "address.city"

Working with Field Descriptions Directly

You can use extract_field_descriptions and apply_optimized_descriptions independently to inspect and modify field descriptions without running optimization:

from pydantic import BaseModel, Field
from dspydantic import extract_field_descriptions, apply_optimized_descriptions

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Price")
    in_stock: bool = Field(description="Availability")

# Extract current descriptions
descriptions = extract_field_descriptions(Product)
print(descriptions)
# {'name': 'Product name', 'price': 'Price', 'in_stock': 'Availability'}

# Manually improve descriptions (or use optimization results)
improved_descriptions = {
    "name": "The full product name as displayed to customers",
    "price": "Price in USD without currency symbol",
    "in_stock": "True if item is currently available for purchase"
}

# Apply improved descriptions to create a schema
optimized_schema = apply_optimized_descriptions(Product, improved_descriptions)

# Use the optimized schema with any LLM that accepts JSON schemas

Use Cases:

Inspect current descriptions: See what descriptions are currently set in your model
Manual refinement: Manually improve descriptions based on testing or domain knowledge
Schema generation: Create production-ready JSON schemas with optimized descriptions
Integration: Prepare schemas for OpenAI, Anthropic, or other structured output APIs

API Reference

`PydanticOptimizer`

Main optimizer class.

Parameters:

model (type[BaseModel]): The Pydantic model class to optimize
examples (list[Example]): List of examples for optimization
evaluate_fn (Callable): Function that evaluates quality. Receives (Example, optimized_descriptions, optimized_system_prompt, optimized_instruction_prompt) and returns 0.0-1.0
system_prompt (str | None): Optional initial system prompt to optimize
instruction_prompt (str | None): Optional initial instruction prompt to optimize
lm (dspy.LM | None): Optional DSPy language model instance. If provided, this will be used instead of creating a new one. If None, a new dspy.LM will be created from model_id/api_key/etc.
model_id (str): LLM model ID (default: "gpt-4o"). Only used if lm is None.
api_key (str | None): API key (default: from OPENAI_API_KEY env var). Only used if lm is None.
api_base (str | None): API base URL for Azure/custom endpoints. Only used if lm is None.
api_version (str | None): API version for Azure. Only used if lm is None.
num_threads (int): Number of optimization threads (default: 4)
init_temperature (float): Initial temperature (default: 1.0)
verbose (bool): Print progress (default: False)
optimizer_type (str): Optimizer type (default: "miprov2zeroshot")
train_split (float): Training split ratio (default: 0.8)

Returns:

OptimizationResult: Contains optimized descriptions and metrics

`extract_field_descriptions(model)`

Extract field descriptions from a Pydantic model recursively.

Parameters:

model (type[BaseModel]): The Pydantic model class to extract descriptions from

Returns:

dict[str, str]: Dictionary mapping field paths to their descriptions. Field paths use dot notation for nested fields (e.g., "address.street").

Example:

from pydantic import BaseModel, Field
from dspydantic import extract_field_descriptions

# Simple model
class User(BaseModel):
    name: str = Field(description="User's full name")
    age: int = Field(description="User's age in years")
    email: str = Field(description="User's email address")

descriptions = extract_field_descriptions(User)
# Returns: {
#     "name": "User's full name",
#     "age": "User's age in years",
#     "email": "User's email address"
# }

# Nested model
class Address(BaseModel):
    street: str = Field(description="Street address")
    city: str = Field(description="City name")
    zip_code: str = Field(description="ZIP code")

class Person(BaseModel):
    name: str = Field(description="Person's name")
    address: Address = Field(description="Home address")
    phone_numbers: list[str] = Field(description="List of phone numbers")

descriptions = extract_field_descriptions(Person)
# Returns: {
#     "name": "Person's name",
#     "address": "Home address",
#     "address.street": "Street address",
#     "address.city": "City name",
#     "address.zip_code": "ZIP code",
#     "phone_numbers": "List of phone numbers"
# }

# Use case: Inspect current descriptions before optimization
current_descriptions = extract_field_descriptions(Invoice)
print("Current field descriptions:")
for field_path, description in current_descriptions.items():
    print(f"  {field_path}: {description}")

`apply_optimized_descriptions(model, optimized_descriptions)`

Create a modified JSON schema with optimized field descriptions applied. This is useful for creating schemas compatible with OpenAI structured outputs, Anthropic, or other systems that accept JSON schemas.

Parameters:

model (type[BaseModel]): The original Pydantic model class
optimized_descriptions (dict[str, str]): Dictionary mapping field paths to optimized descriptions

Returns:

dict[str, Any]: Modified JSON schema dictionary with optimized descriptions. For OpenAI structured outputs, wrap it as shown in the examples below.

Example - Basic Usage:

from pydantic import BaseModel, Field
from dspydantic import apply_optimized_descriptions

class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice ID")
    total_amount: float = Field(description="Total amount")
    date: str = Field(description="Invoice date")

# After optimization, you have optimized descriptions
optimized_descriptions = {
    "invoice_number": "The unique alphanumeric identifier found at the top of the invoice",
    "total_amount": "The final amount due including all taxes and fees",
    "date": "The invoice date in YYYY-MM-DD format"
}

# Apply optimized descriptions to create a JSON schema
optimized_schema = apply_optimized_descriptions(Invoice, optimized_descriptions)

# Use with OpenAI structured outputs
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Extract invoice data from: INV-2024-001, $1,234.56, 2024-01-15"}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": Invoice.__name__,
            "schema": optimized_schema,
            "strict": True
        }
    }
)

Example - Nested Models:

from pydantic import BaseModel, Field
from dspydantic import apply_optimized_descriptions

class Address(BaseModel):
    street: str = Field(description="Street address")
    city: str = Field(description="City name")
    state: str = Field(description="State abbreviation")

class Customer(BaseModel):
    name: str = Field(description="Customer name")
    email: str = Field(description="Email address")
    address: Address = Field(description="Billing address")

# Optimized descriptions for nested fields use dot notation
optimized_descriptions = {
    "name": "The customer's full legal name",
    "email": "Primary contact email address",
    "address": "Complete billing address information",
    "address.street": "Street number and name",
    "address.city": "City name (not abbreviated)",
    "address.state": "Two-letter US state code (e.g., CA, NY)"
}

# Apply to create optimized schema
optimized_schema = apply_optimized_descriptions(Customer, optimized_descriptions)

# The schema now has optimized descriptions at all levels
print(optimized_schema["properties"]["address"]["properties"]["street"]["description"])
# Output: "Street number and name"

Example - Complete Workflow:

from pydantic import BaseModel, Field
from dspydantic import (
    PydanticOptimizer,
    Example,
    extract_field_descriptions,
    apply_optimized_descriptions
)

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Product price")
    category: str = Field(description="Product category")

# Step 1: Extract current descriptions (optional, for inspection)
current_descriptions = extract_field_descriptions(Product)
print("Before optimization:", current_descriptions)

# Step 2: Prepare examples and optimize
examples = [
    Example(
        input_data={"text": "iPhone 15 Pro, $999, Electronics"},
        expected_output={"name": "iPhone 15 Pro", "price": 999.0, "category": "Electronics"}
    ),
    # ... more examples
]

optimizer = PydanticOptimizer(
    model=Product,
    examples=examples,
    model_id="gpt-4o",
    evaluate_fn="exact"
)

result = optimizer.optimize()

# Step 3: Apply optimized descriptions to create a production-ready schema
optimized_schema = apply_optimized_descriptions(Product, result.optimized_descriptions)

# Step 4: Use the optimized schema with your LLM
openai_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": Product.__name__,
        "schema": optimized_schema,
        "strict": True
    }
}

# Now use openai_schema in your API calls

Example - Comparing Before and After:

from pydantic import BaseModel, Field
from dspydantic import extract_field_descriptions, apply_optimized_descriptions
import json

class Document(BaseModel):
    title: str = Field(description="Document title")
    author: str = Field(description="Author name")
    pages: int = Field(description="Number of pages")

# Get original descriptions
original_descriptions = extract_field_descriptions(Document)
print("Original descriptions:")
for path, desc in original_descriptions.items():
    print(f"  {path}: {desc}")

# After optimization, you have improved descriptions
optimized_descriptions = {
    "title": "The main title of the document, typically found at the top of the first page",
    "author": "The full name of the person or organization who created the document",
    "pages": "The total number of pages in the document as a whole number"
}

# Create schemas for comparison
original_schema = Document.model_json_schema()
optimized_schema = apply_optimized_descriptions(Document, optimized_descriptions)

# Compare field descriptions
print("\nComparison:")
for field_name in original_schema["properties"]:
    original_desc = original_schema["properties"][field_name].get("description", "N/A")
    optimized_desc = optimized_schema["properties"][field_name].get("description", "N/A")
    print(f"\n{field_name}:")
    print(f"  Original:  {original_desc}")
    print(f"  Optimized: {optimized_desc}")

License

Apache 2.0

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.6

Mar 20, 2026

0.1.5

Mar 20, 2026

0.1.4

Mar 18, 2026

0.1.3

Mar 18, 2026

0.1.2

Mar 16, 2026

0.1.1

Jan 30, 2026

0.1

Jan 27, 2026

0.0.7

Dec 10, 2025

0.0.6

Dec 9, 2025

0.0.5

Dec 9, 2025

0.0.4

Dec 6, 2025

0.0.3

Dec 5, 2025

0.0.2

Dec 5, 2025

This version

0.0.1

Dec 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dspydantic-0.0.1.tar.gz (204.8 kB view details)

Uploaded Dec 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dspydantic-0.0.1-py3-none-any.whl (27.2 kB view details)

Uploaded Dec 5, 2025 Python 3

File details

Details for the file dspydantic-0.0.1.tar.gz.

File metadata

Download URL: dspydantic-0.0.1.tar.gz
Upload date: Dec 5, 2025
Size: 204.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.18

File hashes

Hashes for dspydantic-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`f30bb62f6bf9d4f8509cd11b0754e2ea8bc2c621e7be6b9597978fe7f5e87754`
MD5	`88976298dbfab59f02eaf363ef85135a`
BLAKE2b-256	`0be3deed4a4a85c663a616b26c30673950211c6b0e3845ad9f67839a1206888a`

See more details on using hashes here.

File details

Details for the file dspydantic-0.0.1-py3-none-any.whl.

File metadata

Download URL: dspydantic-0.0.1-py3-none-any.whl
Upload date: Dec 5, 2025
Size: 27.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.18

File hashes

Hashes for dspydantic-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc612d0469747e0be1ab9d2446cad67645e1bdaadf132a21f5de7917dbd6fdff`
MD5	`6ba2cb92e2b713b14a22a0c306483d53`
BLAKE2b-256	`cbfeb60e3bb7c2eb890e41d4bc4ab93e5bd0e5de4f4321783018a89efcff4b8d`

See more details on using hashes here.

dspydantic 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

dspydantic

Overview

Why dspydantic?

Use Cases

Features

Installation

Quick Start

Text Input Example

Image Input Example

PDF Input Example

Combined Text and Image Example

Usage

Basic Example

Custom Evaluation Function

System and Instruction Prompts

Custom DSPy Language Model

Optimizer Types

Nested Models

Working with Field Descriptions Directly

API Reference

PydanticOptimizer

extract_field_descriptions(model)

apply_optimized_descriptions(model, optimized_descriptions)

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`PydanticOptimizer`

`extract_field_descriptions(model)`

`apply_optimized_descriptions(model, optimized_descriptions)`