Skip to main content

Convert Polars or Pandas DataFrames to lists of Pydantic models with schema inference

Project description

❄️ Articuno ❄️

Convert Polars or Pandas DataFrames to Pydantic models with schema inference — and generate clean Python class code.

Python 3.8+ Type Checked Code Style Test Coverage


✨ Features

Core Functionality:

  • 🔍 Infer Pydantic models dynamically from Polars or Pandas DataFrames
  • 📋 Infer models directly from iterables of dictionaries (SQL results, JSON records, etc.)
  • 🎯 Automatic type detection for basic types, nested structures, and temporal data
  • 🔄 Generator-based for memory-efficient processing of large datasets
  • 🎨 Generate clean Python model code using datamodel-code-generator

Advanced Features:

  • PyArrow support for high-performance Pandas columns (int64[pyarrow], string[pyarrow], timestamp[pyarrow], etc.)
  • 📅 Full temporal type support: datetime, date, timedelta across all backends
  • 🗂️ Nested structures: Supports nested dicts, lists, and complex hierarchies
  • 🔧 Optional field detection: Automatically identifies nullable fields
  • 🎛️ Configurable scanning: max_scan parameter to limit schema inference
  • 🔒 Force optional mode: Make all fields optional regardless of data
  • Model name validation: Ensures valid Python identifiers
  • 🧪 Comprehensively tested: 112 tests, 87% code coverage

Design:

  • 🪶 Lightweight, dependency-flexible architecture
  • 🔌 Optional dependencies for Polars, Pandas, and PyArrow
  • 🎯 Type-checked with mypy
  • 📏 Linted with ruff

📦 Installation

Install the core package:

pip install articuno

Add optional dependencies as needed:

# For Polars support
pip install articuno[polars]

# For Pandas support (with PyArrow)
pip install articuno[pandas]

# Full install with all backends
pip install articuno[polars,pandas]

# Development dependencies (includes pytest, mypy, ruff)
pip install articuno[dev]

🚀 Quick Start

DataFrame to Pydantic Models

from articuno import df_to_pydantic, infer_pydantic_model
import polars as pl

# Create a DataFrame
df = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "score": [95.5, 88.0, 92.3],
    "active": [True, False, True]
})

# Convert to Pydantic instances (returns a generator)
instances = list(df_to_pydantic(df, model_name="UserModel"))
print(instances[0])
# Output: id=1 name='Alice' score=95.5 active=True

# Or just get the model class
Model = infer_pydantic_model(df, model_name="UserModel")
print(Model.model_json_schema())

Dict Iterables to Pydantic

Perfect for SQL query results, API responses, or JSON data:

from articuno import df_to_pydantic, infer_pydantic_model

# From database results, API responses, etc.
records = [
    {"id": 1, "name": "Alice", "email": "alice@example.com"},
    {"id": 2, "name": "Bob", "email": "bob@example.com"},
]

# Automatically infer and create instances
instances = list(df_to_pydantic(records, model_name="User"))

# Or infer just the model
Model = infer_pydantic_model(records, model_name="User")

📓 Example Notebooks

Comprehensive Jupyter notebooks demonstrating all features:

Core Examples

Advanced Examples

Legacy Examples

Note: All example notebooks have been executed and saved with outputs, so you can view the results directly on GitHub without running them.


📚 Advanced Usage

Temporal Types Support

Articuno fully supports datetime, date, and timedelta types:

from datetime import datetime, date, timedelta
from articuno import infer_pydantic_model

data = [
    {
        "event_id": 1,
        "event_date": date(2024, 1, 15),
        "timestamp": datetime(2024, 1, 15, 10, 30),
        "duration": timedelta(hours=2, minutes=30)
    }
]

Model = infer_pydantic_model(data, model_name="Event")
# Fields will have correct datetime.date, datetime.datetime, datetime.timedelta types

PyArrow-Backed Pandas Columns

Full support for high-performance PyArrow dtypes:

import pandas as pd
from articuno import infer_pydantic_model

df = pd.DataFrame({
    "id": pd.Series([1, 2, 3], dtype="int64[pyarrow]"),
    "name": pd.Series(["Alice", "Bob", "Charlie"], dtype="string[pyarrow]"),
    "created": pd.Series([
        datetime(2024, 1, 1),
        datetime(2024, 1, 2),
        datetime(2024, 1, 3)
    ], dtype=pd.ArrowDtype(pa.timestamp("ms"))),
    "active": pd.Series([True, False, True], dtype="bool[pyarrow]")
})

Model = infer_pydantic_model(df, model_name="ArrowModel")

Supported PyArrow types:

  • int64[pyarrow], int32[pyarrow], etc.
  • string[pyarrow]
  • bool[pyarrow]
  • timestamp[pyarrow]datetime.datetime
  • date32[pyarrow], date64[pyarrow]datetime.date
  • duration[pyarrow]datetime.timedelta

Nested Structures

Handle complex nested data with ease:

data = [
    {
        "user_id": 1,
        "profile": {
            "name": "Alice",
            "age": 30,
            "preferences": {
                "theme": "dark",
                "notifications": True
            }
        },
        "tags": ["python", "data-science"]
    }
]

Model = infer_pydantic_model(data, model_name="UserProfile")
# Nested dicts become nested Pydantic models
# Lists are preserved with List[...] typing

Force Optional Fields

Make all fields optional regardless of the data:

from articuno import infer_pydantic_model

df = pl.DataFrame({
    "required": [1, 2, 3],
    "also_required": ["a", "b", "c"]
})

# Force all fields to be Optional
Model = infer_pydantic_model(df, force_optional=True)

# Now you can create instances with None values
instance = Model(required=None, also_required=None)

Limit Schema Scanning

For large datasets, limit how many records are scanned:

# Only scan first 100 records for schema inference
Model = infer_pydantic_model(
    large_dataset,
    model_name="LargeModel",
    max_scan=100
)

Memory-Efficient Processing

df_to_pydantic returns a generator for memory efficiency:

# Generator - memory efficient for large datasets
instances_gen = df_to_pydantic(large_df, model_name="Record")

# Process one at a time
for instance in instances_gen:
    process(instance)

# Or collect all at once if needed
instances_list = list(df_to_pydantic(df, model_name="Record"))

Code Generation

Generate clean Python code for your models:

from articuno import infer_pydantic_model
from articuno.codegen import generate_class_code

# Infer model from data
Model = infer_pydantic_model(data, model_name="User")

# Generate Python code
code = generate_class_code(Model)
print(code)

# Or save to file
code = generate_class_code(Model, output_path="models.py")

Pre-defined Models

Use a pre-defined model instead of inferring:

from pydantic import BaseModel
from articuno import df_to_pydantic

class UserModel(BaseModel):
    id: int
    name: str
    email: str

# Use your existing model
instances = list(df_to_pydantic(df, model=UserModel))

⚙️ Supported Type Mappings

Polars Type Pandas Type (incl. PyArrow) Dict/Iterable Pydantic Type
pl.Int*, pl.UInt* int64, Int64, int64[pyarrow] int int
pl.Float* float64, float64[pyarrow] float float
pl.Utf8, pl.String object, string[pyarrow] str str
pl.Boolean bool, bool[pyarrow] bool bool
pl.Date datetime64[ns], date[pyarrow] date datetime.date
pl.Datetime datetime64[ns], timestamp[pyarrow] datetime datetime.datetime
pl.Duration timedelta64[ns], duration[pyarrow] timedelta datetime.timedelta
pl.List list list List[...]
pl.Struct dict (nested) dict (nested) Nested BaseModel
pl.Null None, NaN None Optional[...]

🎯 Real-World Examples

API Response Processing

# Process API responses
api_data = [
    {
        "status": "success",
        "data": {
            "user_id": 123,
            "username": "alice",
            "created_at": datetime(2024, 1, 15, 10, 30)
        }
    }
]

APIResponse = infer_pydantic_model(api_data, model_name="APIResponse")
instances = list(df_to_pydantic(api_data, model=APIResponse))

SQL Query Results

import sqlite3
from articuno import infer_pydantic_model, df_to_pydantic

# Get results from database
conn = sqlite3.connect("database.db")
conn.row_factory = sqlite3.Row
cursor = conn.execute("SELECT * FROM users")
rows = [dict(row) for row in cursor.fetchall()]

# Convert to Pydantic
UserModel = infer_pydantic_model(rows, model_name="User")
users = list(df_to_pydantic(rows, model=UserModel))

E-commerce Order Processing

orders = [
    {
        "order_id": 1001,
        "customer": {"id": 501, "name": "Alice", "email": "alice@example.com"},
        "items": [
            {"product": "Laptop", "quantity": 1, "price": 999.99},
            {"product": "Mouse", "quantity": 2, "price": 29.99}
        ],
        "total": 1059.97,
        "created_at": datetime(2024, 1, 15, 10, 30)
    }
]

Order = infer_pydantic_model(orders, model_name="Order")

🧪 Testing & Quality

Articuno is thoroughly tested and type-checked:

# Run tests
pytest

# Run with coverage
pytest --cov=articuno --cov-report=term-missing

# Type checking
mypy articuno

# Linting
ruff check .

Test Statistics:

  • 112 comprehensive tests
  • 87% code coverage
  • All tests passing ✅
  • Type-checked with mypy ✅
  • Linted with ruff ✅

🔧 Development

# Clone the repository
git clone https://github.com/eddiethedean/articuno.git
cd articuno

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run type checking
mypy articuno

# Run linting
ruff check .

💡 Tips & Best Practices

  1. Use generators for large datasets: df_to_pydantic returns a generator by default for memory efficiency
  2. Limit scanning for performance: Use max_scan parameter when dealing with huge datasets
  3. Validate model names: Articuno automatically validates that model names are valid Python identifiers
  4. Handle optional fields: Use force_optional=True when working with sparse data
  5. Type precedence: Articuno correctly handles bool vs int (bool is checked first)

🐛 Troubleshooting

Import Errors

If you get import errors for polars or pandas:

pip install articuno[polars]  # or [pandas]

PyArrow Issues

For PyArrow support:

pip install pyarrow

Generator Indexing

df_to_pydantic returns a generator. Convert to list if you need indexing:

instances = list(df_to_pydantic(df, model_name="Model"))
print(instances[0])  # Now you can index

📖 API Reference

Main Functions

infer_pydantic_model(source, model_name="AutoModel", force_optional=False, max_scan=1000)

Infer a Pydantic model class from a DataFrame or dict iterable.

Parameters:

  • source: Pandas DataFrame, Polars DataFrame, or iterable of dicts
  • model_name: Name for the generated model (must be valid Python identifier)
  • force_optional: Make all fields optional
  • max_scan: Max records to scan for dict iterables

Returns: Type[BaseModel]

df_to_pydantic(source, model=None, model_name=None, force_optional=False, max_scan=1000)

Convert DataFrame or dict iterable to Pydantic instances.

Parameters:

  • source: Pandas DataFrame, Polars DataFrame, or iterable of dicts
  • model: Optional pre-defined model to use
  • model_name: Name for inferred model if model is None
  • force_optional: Make all fields optional
  • max_scan: Max records to scan for dict iterables

Returns: Generator[BaseModel, None, None]

generate_class_code(model, output_path=None, model_name=None)

Generate Python code from a Pydantic model.

Parameters:

  • model: Pydantic model class
  • output_path: Optional file path to write code to
  • model_name: Optional name override

Returns: str (the generated code)


🔗 Links


📄 License

MIT © Odos Matthews


🙏 Acknowledgments


📝 Changelog

v0.9.0

  • ✨ Added full datetime/date/timedelta support across all backends
  • ✨ Added PyArrow temporal type support (timestamp, date, duration)
  • ✨ Added model name validation (ensures valid Python identifiers)
  • 🐛 Fixed bool vs int type precedence
  • 🐛 Fixed DataFrame vs iterable detection order
  • 🐛 Fixed temporary directory cleanup in code generation
  • 🐛 Added defensive checks for empty samples
  • 📝 Improved documentation with generator behavior notes
  • 🧪 Added comprehensive test suite (112 tests, 87% coverage)
  • 🔍 Full mypy type checking
  • 📏 Ruff linting compliance
  • 📓 Added 9 comprehensive example notebooks with outputs
  • 📚 Enhanced README with complete guide and examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

articuno-0.9.0.tar.gz (28.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

articuno-0.9.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file articuno-0.9.0.tar.gz.

File metadata

  • Download URL: articuno-0.9.0.tar.gz
  • Upload date:
  • Size: 28.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for articuno-0.9.0.tar.gz
Algorithm Hash digest
SHA256 093c6ffb535251c422288cbdedabcca41511c19c0aeae310bbfdc7bba531f6eb
MD5 ba65716b55fd5a190b3849fb134f7d69
BLAKE2b-256 ace1137701baaea7121616bcaead68da3aaa79c53cf7a98e5e3713efc411ff15

See more details on using hashes here.

File details

Details for the file articuno-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: articuno-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for articuno-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ef2f478cd3283b757e562dc860f573faaab65eed55cb569aa81527b855b53bfd
MD5 9854f99999412ea56cca5f72a73bc92c
BLAKE2b-256 102cd8b693b36d127e7b1c2a974db73eda3c92691ae28c7819c89c9e4cd85114

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page