Skip to main content

Convert Polars or Pandas DataFrames to lists of Pydantic models with schema inference

Project description

❄️ Articuno ❄️

Convert Polars, Pandas, PySpark DataFrames, SQLAlchemy models, or SQLModel classes to Pydantic models with schema inference — and generate clean Python class code. Also supports bidirectional conversion from Pydantic to these formats.

Python 3.8+ Type Checked Code Style Test Coverage


✨ Features

Core Functionality:

  • 🔍 Infer Pydantic models dynamically from Polars, Pandas, or PySpark DataFrames
  • 🗄️ Convert SQLAlchemy and SQLModel model classes to/from Pydantic
  • 📋 Infer models directly from iterables of dictionaries (SQL results, JSON records, etc.)
  • 🎯 Automatic type detection for basic types, nested structures, and temporal data
  • 🔄 Generator-based for memory-efficient processing of large datasets
  • 🔁 Bidirectional conversions between Pydantic and supported backends
  • 🎨 Generate clean Python model code using datamodel-code-generator

Advanced Features:

  • PyArrow support for high-performance Pandas columns (int64[pyarrow], string[pyarrow], timestamp[pyarrow], etc.)
  • 📅 Full temporal type support: datetime, date, timedelta across all backends
  • 🗂️ Nested structures: Supports nested dicts, lists, and complex hierarchies
  • 🔧 Optional field detection: Automatically identifies nullable fields
  • 🎛️ Configurable scanning: max_scan parameter to limit schema inference
  • 🔒 Force optional mode: Make all fields optional regardless of data
  • Model name validation: Ensures valid Python identifiers
  • 🧪 Comprehensively tested: 112 tests, 87% code coverage

Design:

  • 🪶 Lightweight, dependency-flexible architecture
  • 🔌 Optional dependencies for Polars, Pandas, PyArrow, SQLAlchemy, SQLModel, and PySpark
  • 🎯 Type-checked with mypy
  • 📏 Linted with ruff

📦 Installation

Install the core package:

pip install articuno

Add optional dependencies as needed:

# For Polars support
pip install articuno[polars]

# For Pandas support (with PyArrow)
pip install articuno[pandas]

# For SQLAlchemy support
pip install articuno[sqlalchemy]

# For SQLModel support
pip install articuno[sqlmodel]

# For PySpark support
pip install articuno[pyspark]

# Full install with all backends
pip install articuno[polars,pandas,sqlalchemy,sqlmodel,pyspark]

# Development dependencies (includes pytest, mypy, ruff)
pip install articuno[dev]

🚀 Quick Start

DataFrame to Pydantic Models

from articuno import df_to_pydantic, infer_pydantic_model
import polars as pl

# Create a DataFrame
df = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "score": [95.5, 88.0, 92.3],
    "active": [True, False, True]
})

# Convert to Pydantic instances (returns a generator)
instances = list(df_to_pydantic(df, model_name="UserModel"))
print(instances[0])
# Output: id=1 name='Alice' score=95.5 active=True

# Or just get the model class
Model = infer_pydantic_model(df, model_name="UserModel")
print(Model.model_json_schema())

Dict Iterables to Pydantic

Perfect for SQL query results, API responses, or JSON data:

from articuno import df_to_pydantic, infer_pydantic_model

# From database results, API responses, etc.
records = [
    {"id": 1, "name": "Alice", "email": "alice@example.com"},
    {"id": 2, "name": "Bob", "email": "bob@example.com"},
]

# Automatically infer and create instances
instances = list(df_to_pydantic(records, model_name="User"))

# Or infer just the model
Model = infer_pydantic_model(records, model_name="User")

📓 Example Notebooks

Comprehensive Jupyter notebooks demonstrating all features:

Core Examples

Advanced Examples

Legacy Examples

Note: All example notebooks have been executed and saved with outputs, so you can view the results directly on GitHub without running them.


📚 Advanced Usage

Temporal Types Support

Articuno fully supports datetime, date, and timedelta types:

from datetime import datetime, date, timedelta
from articuno import infer_pydantic_model

data = [
    {
        "event_id": 1,
        "event_date": date(2024, 1, 15),
        "timestamp": datetime(2024, 1, 15, 10, 30),
        "duration": timedelta(hours=2, minutes=30)
    }
]

Model = infer_pydantic_model(data, model_name="Event")
# Fields will have correct datetime.date, datetime.datetime, datetime.timedelta types

PyArrow-Backed Pandas Columns

Full support for high-performance PyArrow dtypes:

import pandas as pd
from articuno import infer_pydantic_model

df = pd.DataFrame({
    "id": pd.Series([1, 2, 3], dtype="int64[pyarrow]"),
    "name": pd.Series(["Alice", "Bob", "Charlie"], dtype="string[pyarrow]"),
    "created": pd.Series([
        datetime(2024, 1, 1),
        datetime(2024, 1, 2),
        datetime(2024, 1, 3)
    ], dtype=pd.ArrowDtype(pa.timestamp("ms"))),
    "active": pd.Series([True, False, True], dtype="bool[pyarrow]")
})

Model = infer_pydantic_model(df, model_name="ArrowModel")

Supported PyArrow types:

  • int64[pyarrow], int32[pyarrow], etc.
  • string[pyarrow]
  • bool[pyarrow]
  • timestamp[pyarrow]datetime.datetime
  • date32[pyarrow], date64[pyarrow]datetime.date
  • duration[pyarrow]datetime.timedelta

Nested Structures

Handle complex nested data with ease:

data = [
    {
        "user_id": 1,
        "profile": {
            "name": "Alice",
            "age": 30,
            "preferences": {
                "theme": "dark",
                "notifications": True
            }
        },
        "tags": ["python", "data-science"]
    }
]

Model = infer_pydantic_model(data, model_name="UserProfile")
# Nested dicts become nested Pydantic models
# Lists are preserved with List[...] typing

Force Optional Fields

Make all fields optional regardless of the data:

from articuno import infer_pydantic_model

df = pl.DataFrame({
    "required": [1, 2, 3],
    "also_required": ["a", "b", "c"]
})

# Force all fields to be Optional
Model = infer_pydantic_model(df, force_optional=True)

# Now you can create instances with None values
instance = Model(required=None, also_required=None)

Limit Schema Scanning

For large datasets, limit how many records are scanned:

# Only scan first 100 records for schema inference
Model = infer_pydantic_model(
    large_dataset,
    model_name="LargeModel",
    max_scan=100
)

Memory-Efficient Processing

df_to_pydantic returns a generator for memory efficiency:

# Generator - memory efficient for large datasets
instances_gen = df_to_pydantic(large_df, model_name="Record")

# Process one at a time
for instance in instances_gen:
    process(instance)

# Or collect all at once if needed
instances_list = list(df_to_pydantic(df, model_name="Record"))

Code Generation

Generate clean Python code for your models:

from articuno import infer_pydantic_model
from articuno.codegen import generate_class_code

# Infer model from data
Model = infer_pydantic_model(data, model_name="User")

# Generate Python code
code = generate_class_code(Model)
print(code)

# Or save to file
code = generate_class_code(Model, output_path="models.py")

Pre-defined Models

Use a pre-defined model instead of inferring:

from pydantic import BaseModel
from articuno import df_to_pydantic

class UserModel(BaseModel):
    id: int
    name: str
    email: str

# Use your existing model
instances = list(df_to_pydantic(df, model=UserModel))

SQLAlchemy Model Conversion

Convert between SQLAlchemy declarative models and Pydantic:

from sqlalchemy import Column, Integer, String, Float
from sqlalchemy.orm import DeclarativeBase
from articuno import infer_pydantic_model, pydantic_to_sqlalchemy

class Base(DeclarativeBase):
    pass

class User(Base):
    __tablename__ = "users"
    id = Column(Integer, primary_key=True)
    name = Column(String(255), nullable=False)
    email = Column(String(255), nullable=True)

# Convert SQLAlchemy model to Pydantic
PydanticModel = infer_pydantic_model(User, model_name="UserModel")

# Convert Pydantic model to SQLAlchemy
from pydantic import BaseModel as PydanticBase

class ProductModel(PydanticBase):
    id: int
    name: str
    price: float

SQLAlchemyModel = pydantic_to_sqlalchemy(ProductModel, model_name="Product")

SQLModel Conversion

Convert between SQLModel and Pydantic (SQLModel already extends Pydantic):

from sqlmodel import SQLModel, Field
from articuno import infer_pydantic_model, pydantic_to_sqlmodel

class User(SQLModel, table=True):
    __tablename__ = "users"
    id: int | None = Field(default=None, primary_key=True)
    name: str
    email: str | None = None

# Convert SQLModel to Pydantic
PydanticModel = infer_pydantic_model(User, model_name="UserModel")

# Convert Pydantic to SQLModel
from pydantic import BaseModel

class ProductModel(BaseModel):
    id: int | None = None
    name: str
    price: float

SQLModelClass = pydantic_to_sqlmodel(ProductModel, model_name="Product")

PySpark DataFrame Conversion

Convert between PySpark DataFrames and Pydantic:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, StringType
from articuno import infer_pydantic_model, df_to_pydantic, pydantic_to_pyspark

spark = SparkSession.builder.appName("articuno").getOrCreate()

# Create PySpark DataFrame
data = [(1, "Alice"), (2, "Bob")]
schema = StructType([
    StructField("id", LongType(), False),
    StructField("name", StringType(), False),
])
df = spark.createDataFrame(data, schema=schema)

# Convert PySpark DataFrame to Pydantic
Model = infer_pydantic_model(df, model_name="UserModel")
instances = list(df_to_pydantic(df, model=Model))

# Convert Pydantic instances to PySpark DataFrame
from pydantic import BaseModel

class UserModel(BaseModel):
    id: int
    name: str

instances = [UserModel(id=1, name="Alice"), UserModel(id=2, name="Bob")]
df = pydantic_to_pyspark(instances, model=UserModel)

⚙️ Supported Type Mappings

Polars Type Pandas Type (incl. PyArrow) Dict/Iterable SQLAlchemy Type SQLModel Type PySpark Type Pydantic Type
pl.Int*, pl.UInt* int64, Int64, int64[pyarrow] int Integer, BigInteger int IntegerType, LongType int
pl.Float* float64, float64[pyarrow] float Float, Numeric float FloatType, DoubleType float
pl.Utf8, pl.String object, string[pyarrow] str String, Text str StringType str
pl.Boolean bool, bool[pyarrow] bool Boolean bool BooleanType bool
pl.Date datetime64[ns], date[pyarrow] date Date date DateType datetime.date
pl.Datetime datetime64[ns], timestamp[pyarrow] datetime DateTime datetime TimestampType datetime.datetime
pl.Duration timedelta64[ns], duration[pyarrow] timedelta - - - datetime.timedelta
pl.List list list - List[...] ArrayType List[...]
pl.Struct dict (nested) dict (nested) - - StructType Nested BaseModel
pl.Null None, NaN None nullable=True Optional[...] nullable=True Optional[...]

🎯 Real-World Examples

API Response Processing

# Process API responses
api_data = [
    {
        "status": "success",
        "data": {
            "user_id": 123,
            "username": "alice",
            "created_at": datetime(2024, 1, 15, 10, 30)
        }
    }
]

APIResponse = infer_pydantic_model(api_data, model_name="APIResponse")
instances = list(df_to_pydantic(api_data, model=APIResponse))

SQL Query Results

import sqlite3
from articuno import infer_pydantic_model, df_to_pydantic

# Get results from database
conn = sqlite3.connect("database.db")
conn.row_factory = sqlite3.Row
cursor = conn.execute("SELECT * FROM users")
rows = [dict(row) for row in cursor.fetchall()]

# Convert to Pydantic
UserModel = infer_pydantic_model(rows, model_name="User")
users = list(df_to_pydantic(rows, model=UserModel))

E-commerce Order Processing

orders = [
    {
        "order_id": 1001,
        "customer": {"id": 501, "name": "Alice", "email": "alice@example.com"},
        "items": [
            {"product": "Laptop", "quantity": 1, "price": 999.99},
            {"product": "Mouse", "quantity": 2, "price": 29.99}
        ],
        "total": 1059.97,
        "created_at": datetime(2024, 1, 15, 10, 30)
    }
]

Order = infer_pydantic_model(orders, model_name="Order")

🧪 Testing & Quality

Articuno is thoroughly tested and type-checked:

# Run tests
pytest

# Run with coverage
pytest --cov=articuno --cov-report=term-missing

# Type checking
mypy articuno

# Linting
ruff check .

Test Statistics:

  • 112 comprehensive tests
  • 87% code coverage
  • All tests passing ✅
  • Type-checked with mypy ✅
  • Linted with ruff ✅

🔧 Development

# Clone the repository
git clone https://github.com/eddiethedean/articuno.git
cd articuno

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run type checking
mypy articuno

# Run linting
ruff check .

💡 Tips & Best Practices

  1. Use generators for large datasets: df_to_pydantic returns a generator by default for memory efficiency
  2. Limit scanning for performance: Use max_scan parameter when dealing with huge datasets
  3. Validate model names: Articuno automatically validates that model names are valid Python identifiers
  4. Handle optional fields: Use force_optional=True when working with sparse data
  5. Type precedence: Articuno correctly handles bool vs int (bool is checked first)

🐛 Troubleshooting

Import Errors

If you get import errors for polars or pandas:

pip install articuno[polars]  # or [pandas]

PyArrow Issues

For PyArrow support:

pip install pyarrow

Generator Indexing

df_to_pydantic returns a generator. Convert to list if you need indexing:

instances = list(df_to_pydantic(df, model_name="Model"))
print(instances[0])  # Now you can index

📖 API Reference

Main Functions

infer_pydantic_model(source, model_name="AutoModel", force_optional=False, max_scan=1000)

Infer a Pydantic model class from a DataFrame or dict iterable.

Parameters:

  • source: Pandas DataFrame, Polars DataFrame, or iterable of dicts
  • model_name: Name for the generated model (must be valid Python identifier)
  • force_optional: Make all fields optional
  • max_scan: Max records to scan for dict iterables

Returns: Type[BaseModel]

df_to_pydantic(source, model=None, model_name=None, force_optional=False, max_scan=1000)

Convert DataFrame or dict iterable to Pydantic instances.

Parameters:

  • source: Pandas DataFrame, Polars DataFrame, or iterable of dicts
  • model: Optional pre-defined model to use
  • model_name: Name for inferred model if model is None
  • force_optional: Make all fields optional
  • max_scan: Max records to scan for dict iterables

Returns: Generator[BaseModel, None, None]

generate_class_code(model, output_path=None, model_name=None)

Generate Python code from a Pydantic model.

Parameters:

  • model: Pydantic model class
  • output_path: Optional file path to write code to
  • model_name: Optional name override

Returns: str (the generated code)


🔗 Links


📄 License

MIT © Odos Matthews


🙏 Acknowledgments


📝 Changelog

v0.9.0

  • ✨ Added full datetime/date/timedelta support across all backends
  • ✨ Added PyArrow temporal type support (timestamp, date, duration)
  • ✨ Added model name validation (ensures valid Python identifiers)
  • 🐛 Fixed bool vs int type precedence
  • 🐛 Fixed DataFrame vs iterable detection order
  • 🐛 Fixed temporary directory cleanup in code generation
  • 🐛 Added defensive checks for empty samples
  • 📝 Improved documentation with generator behavior notes
  • 🧪 Added comprehensive test suite (112 tests, 87% coverage)
  • 🔍 Full mypy type checking
  • 📏 Ruff linting compliance
  • 📓 Added 9 comprehensive example notebooks with outputs
  • 📚 Enhanced README with complete guide and examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

articuno-0.10.0.tar.gz (39.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

articuno-0.10.0-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file articuno-0.10.0.tar.gz.

File metadata

  • Download URL: articuno-0.10.0.tar.gz
  • Upload date:
  • Size: 39.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for articuno-0.10.0.tar.gz
Algorithm Hash digest
SHA256 c7da326e5880d4414bbf589f76f37980e3bfcfaf3ba4a5d2b638c8bfddb43254
MD5 295f129b4e2b541d8f1a2fe88c982c96
BLAKE2b-256 5c7268a27de690250660daa08ddb67f892a201289b7badffcd6caa39e6e47f9e

See more details on using hashes here.

File details

Details for the file articuno-0.10.0-py3-none-any.whl.

File metadata

  • Download URL: articuno-0.10.0-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for articuno-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 46c456f9d59c779eb20d443b48ace9c62bf2afa9289a48087d5bb91ae63e4f45
MD5 94f7ea1cf5d6ba13721b894bbaece34b
BLAKE2b-256 3d2203ef56e7b249cbf29e9947bf9fe84154cb2555fbaff21ded85269e6518ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page