Automatic Pandera DataFrameModel generator from pandas DataFrames

These details have not been verified by PyPI

Project links

Project description

Pandera Forge 🔨

Pandera Forge is a deterministic generator for Pandera DataFrameModels from pandas, Spark, and Databricks DataFrames. It automatically creates exhaustive, type-safe schema definitions without relying on manual work or LLMs, providing a reliable gauge of your dataset's characteristics including statistics, nullability, uniqueness, and patterns.

Rationale

I have found that when working with LLM's they often fail when working with python code generation for generic dataframes. Especially with feature engineering tasks. With providing an exhaustive schema definition of the dataframe, it helps to ground the LLM and prevents trial and error mistakes when performing analytical tasks.

Features

Automatic Schema Generation: Convert pandas, Spark, or Databricks DataFrames into Pandera DataFrameModels
Multi-Backend Support: Works with pandas, Apache Spark, and Databricks
Comprehensive Field Analysis: Detects nullability, uniqueness, min/max values, and data patterns
Pattern Detection: Identifies common patterns in string columns (emails, URLs, phone numbers, etc.)
Type Safety: Generates properly typed fields with appropriate constraints
Column Name Sanitization: Handles problematic column names (spaces, special characters, keywords)
Validation: Validates generated models against source data
Databricks Integration: Direct Unity Catalog support and Delta Lake integration
Extensible: Optional LLM enrichment for enhanced pattern detection

Installation

pip install pandera-forge

For LLM enrichment features:

pip install pandera-forge[llm]

For Apache Spark support:

pip install pandera-forge[spark]

For full Databricks integration:

pip install pandera-forge[databricks]

Quick Start

from pandas import DataFrame, to_datetime
from pandera_forge import ModelGenerator

# Create a sample DataFrame
df = DataFrame({
    "customer_id": [1, 2, 3, 4],
    "email": ["alice@example.com", "bob@example.com", "charlie@example.com", "david@example.com"],
    "age": [25, 30, 35, 40],
    "is_active": [True, True, False, True],
    "signup_date": to_datetime(["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04"])
})

# Generate the model
generator = ModelGenerator()
model_code = generator.generate(df, model_name="CustomerModel")

print(model_code)

This generates:

from pandera.pandas import DataFrameModel, Field, Timestamp
from pandera.typing.pandas import Series, Int64, Int32, Int16, Int8, Float64, Float32, Float16, String, Bool, DateTime, Category, Object
from typing import Optional


class CustomerModel(DataFrameModel):
    customer_id: Series[Int64] = Field(ge=1, le=4, unique=True, isin=[1,2,3,4])  # 4 distinct values, examples: [1, 2, 3]
    email: Series[Object] = Field(unique=True)  # 4 distinct values, examples: ["alice@example.com", "bob@example.com", "charlie@example.com"], pattern: email
    age: Series[Int64] = Field(ge=25, le=40, unique=True, isin=[25, 30, 35, 40])  # 4 distinct values, examples: [25, 30, 35]
    is_active: Series[Bool] = Field(isin=[True, False])  # 2 distinct values, examples: ["True", "False"]
    signup_date: Series[DateTime] = Field(unique=True)  # 4 distinct values, examples: ["2023-01-01 00:00:00", "2023-01-02 00:00:00", "2023-01-03 00:00:00"]

n.b. the values for the datetimes and emails would also be included in the isin= list, but are omitted for brevity. If the distinct count of a column exceeds 10, the isin constraint is omitted.

Advanced Usage

Pattern Detection

Pandera Forge automatically detects common patterns in string columns:

from pandas import DataFrame
from pandera_forge import ModelGenerator

df = DataFrame({
    "email": ["user@example.com", "admin@test.org"],
    "phone": ["+1234567890", "+0987654321"],
    "url": ["https://example.com", "https://test.org"]
})

generator = ModelGenerator()
model_code = generator.generate(df, detect_patterns=True)

Handling Messy Data

The generator handles problematic column names and mixed data types:

from pandas import DataFrame
from pandera_forge import ModelGenerator

df = DataFrame({
    "Column With Spaces": [1, 2, 3],
    "123_numeric_start": ["a", "b", "c"],
    "class": [True, False, True],  # Reserved keyword
    "!@#$%": [1.0, 2.0, 3.0]  # Special characters
})

generator = ModelGenerator()
model_code = generator.generate(df)  # Automatically sanitizes column names

LLM Enrichment (Optional)

For enhanced pattern detection and documentation using OpenAI, Anthropic, or local LLMs via Ollama:

from pandas import DataFrame
from pandera_forge import ModelGenerator
df: DataFrame = ... # your DataFrame here

# Using OpenAI (default)
generator = ModelGenerator(llm_api_key="your-openai-api-key")
model_code = generator.generate(df, model_name="EnrichedModel")

# Using Anthropic
from pandera_forge.llm_enricher import LLMEnricher
enricher = LLMEnricher(provider="anthropic", api_key="your-anthropic-api-key")
generator = ModelGenerator(llm_enricher=enricher)
model_code = generator.generate(df, model_name="EnrichedModel")

# Using Ollama (local LLMs - no API key needed)
enricher = LLMEnricher(provider="ollama", model="llama3.2")
generator = ModelGenerator(llm_enricher=enricher)
model_code = generator.generate(df, model_name="EnrichedModel")

Ollama Setup:

Install Ollama: https://ollama.ai
Start Ollama: ollama serve
Pull a model: ollama pull llama3.2

Databricks and Spark Support

Pandera Forge now supports Apache Spark DataFrames and Databricks:

from pandera_forge import ModelGenerator

# For Spark DataFrames
generator = ModelGenerator.create_for_spark()
spark_df = spark.table("my_table")
model_code = generator.generate(spark_df, model_name="MyModel")

# For Databricks with Unity Catalog
from pandera_forge.databricks import DatabricksGenerator

generator = DatabricksGenerator(
    host="https://your-workspace.cloud.databricks.com",
    token="your-token",
    catalog="main",
    schema="default"
)

# Generate from a table
model_code = generator.from_table("customers")

# Generate for all tables in a catalog
models = generator.generate_for_catalog()

For complete Databricks documentation, see docs/DATABRICKS.md.

API Reference

ModelGenerator

Main class for generating Pandera models. Supports pandas, Spark, and Databricks DataFrames.

ModelGenerator(
    llm_api_key: Optional[str] = None,
    llm_enricher: Optional[LLMEnricher] = None,
    backend: str = "pandas"  # "pandas", "spark", or "auto"
)

Factory Methods:

# Create generator for Spark DataFrames
ModelGenerator.create_for_spark(llm_api_key=None, sample_size=10000)

# Create generator for Databricks
ModelGenerator.create_for_databricks(
    host=None, token=None, cluster_id=None,
    catalog=None, schema=None, llm_api_key=None
)

Parameters:

llm_api_key: Optional API key for LLM enrichment features (OpenAI by default)
llm_enricher: Optional pre-configured LLMEnricher instance for custom LLM providers

Methods:

generate()

generate(
    df: DataFrame,
    model_name: str = "DataFrameModel",
    validate: bool = True,
    include_examples: bool = True,
    detect_patterns: bool = True,
    source_file: Optional[Path] = None
) -> Optional[str]

Generates a Pandera DataFrameModel from a pandas DataFrame.

Parameters:

df: Source DataFrame to generate model from
model_name: Name for the generated model class
validate: Whether to validate the generated model against the source data
include_examples: Whether to include example values in comments
detect_patterns: Whether to detect patterns in string columns
source_file: Optional path to source file for implementation example

Returns:

Generated model code as string, or None if generation failed

PatternDetector

Detects patterns in string columns.

from pandas import Series

PatternDetector.detect_pattern(
    series: Series, 
    min_match_ratio: float = 0.9
)

Supported patterns:

Email addresses
URLs
Phone numbers (US)
UUIDs
IPv4 addresses
Dates (ISO format)
Credit card numbers
Hex colors
MAC addresses
And more...

Use Cases

Data Contract Generation: Automatically generate data contracts from existing datasets
Data Quality Monitoring: Create schemas for validation in data pipelines
Documentation: Generate schema documentation for data teams
Testing: Create test fixtures with proper type constraints
Migration: Convert existing datasets to validated schemas

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details

Acknowledgments

Built on top of the excellent Pandera library for pandas validation.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Oct 31, 2025

1.0.0

Oct 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandera_forge-1.1.0.tar.gz (40.0 kB view details)

Uploaded Oct 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pandera_forge-1.1.0-py3-none-any.whl (40.0 kB view details)

Uploaded Oct 31, 2025 Python 3

File details

Details for the file pandera_forge-1.1.0.tar.gz.

File metadata

Download URL: pandera_forge-1.1.0.tar.gz
Upload date: Oct 31, 2025
Size: 40.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pandera_forge-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`33319b8a5353f287b5f9eac9c5f6d00330a586e6cb1b98471aef4eed926535a7`
MD5	`aca9fee50e9ad86e13386b7ad9b39148`
BLAKE2b-256	`aa47aaf3e1abfaa6e92d8b9fa780c5a3ea77021a25ca83d5334de9101c2ad9c9`

See more details on using hashes here.

File details

Details for the file pandera_forge-1.1.0-py3-none-any.whl.

File metadata

Download URL: pandera_forge-1.1.0-py3-none-any.whl
Upload date: Oct 31, 2025
Size: 40.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pandera_forge-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6a5a5ee9f2c382082bab8fdb7263128fcd9a0101e3bd5f72eb517190a9444b30`
MD5	`5699d5b8f6ed84fa7321d2f1bc1dd7f1`
BLAKE2b-256	`0feb80e972524a460c290bda9483905501d3b50b2eac23bae8a5f468852560db`

See more details on using hashes here.

pandera-forge 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pandera Forge 🔨

Rationale

Features

Installation

Quick Start

Advanced Usage

Pattern Detection

Handling Messy Data

LLM Enrichment (Optional)

Databricks and Spark Support

API Reference

ModelGenerator

Parameters:

Methods:

PatternDetector

Use Cases

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes