Convert Polars or Pandas DataFrames to lists of Pydantic models with schema inference
Project description
❄️ Articuno ❄️
Convert Polars, Pandas, PySpark DataFrames, SQLAlchemy models, or SQLModel classes to Pydantic models with schema inference — and generate clean Python class code. Also supports bidirectional conversion from Pydantic to these formats.
✨ Features
Core Functionality:
- 🔍 Infer Pydantic models dynamically from Polars, Pandas, or PySpark DataFrames
- 🗄️ Convert SQLAlchemy and SQLModel model classes to/from Pydantic
- 📋 Infer models directly from iterables of dictionaries (SQL results, JSON records, etc.)
- 🎯 Automatic type detection for basic types, nested structures, and temporal data
- 🔄 Generator-based for memory-efficient processing of large datasets
- 🔁 Bidirectional conversions between Pydantic and supported backends
- 🎨 Generate clean Python model code using datamodel-code-generator
Advanced Features:
- ⚡ PyArrow support for high-performance Pandas columns (
int64[pyarrow],string[pyarrow],timestamp[pyarrow], etc.) - 📅 Full temporal type support:
datetime,date,timedeltaacross all backends - 🗂️ Nested structures: Supports nested dicts, lists, and complex hierarchies
- 🔧 Optional field detection: Automatically identifies nullable fields
- 🎛️ Configurable scanning:
max_scanparameter to limit schema inference - 🔒 Force optional mode: Make all fields optional regardless of data
- ✅ Model name validation: Ensures valid Python identifiers
- 🧪 Comprehensively tested: 112 tests, 87% code coverage
Design:
- 🪶 Lightweight, dependency-flexible architecture
- 🔌 Optional dependencies for Polars, Pandas, PyArrow, SQLAlchemy, SQLModel, and PySpark
- 🎯 Type-checked with mypy
- 📏 Linted with ruff
📦 Installation
Install the core package:
pip install articuno
Add optional dependencies as needed:
# For Polars support
pip install articuno[polars]
# For Pandas support (with PyArrow)
pip install articuno[pandas]
# For SQLAlchemy support
pip install articuno[sqlalchemy]
# For SQLModel support
pip install articuno[sqlmodel]
# For PySpark support
pip install articuno[pyspark]
# Full install with all backends
pip install articuno[polars,pandas,sqlalchemy,sqlmodel,pyspark]
# Development dependencies (includes pytest, mypy, ruff)
pip install articuno[dev]
🚀 Quick Start
DataFrame to Pydantic Models
from articuno import df_to_pydantic, infer_pydantic_model
import polars as pl
# Create a DataFrame
df = pl.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"score": [95.5, 88.0, 92.3],
"active": [True, False, True]
})
# Convert to Pydantic instances (returns a generator)
instances = list(df_to_pydantic(df, model_name="UserModel"))
print(instances[0])
# Output: id=1 name='Alice' score=95.5 active=True
# Or just get the model class
Model = infer_pydantic_model(df, model_name="UserModel")
print(Model.model_json_schema())
Dict Iterables to Pydantic
Perfect for SQL query results, API responses, or JSON data:
from articuno import df_to_pydantic, infer_pydantic_model
# From database results, API responses, etc.
records = [
{"id": 1, "name": "Alice", "email": "alice@example.com"},
{"id": 2, "name": "Bob", "email": "bob@example.com"},
]
# Automatically infer and create instances
instances = list(df_to_pydantic(records, model_name="User"))
# Or infer just the model
Model = infer_pydantic_model(records, model_name="User")
📓 Example Notebooks
Comprehensive Jupyter notebooks demonstrating all features:
Core Examples
- 01_quick_start.ipynb - Basic usage with Polars, Pandas, and dict iterables
- 02_temporal_types.ipynb - Working with datetime, date, and timedelta
- 03_pyarrow_support.ipynb - PyArrow-backed Pandas columns
- 04_nested_structures.ipynb - Complex nested dictionaries and lists
Advanced Examples
- 05_force_optional_and_max_scan.ipynb - Control optional fields and scanning
- 06_code_generation.ipynb - Generate Python code from models
- 07_api_responses.ipynb - Process REST API responses
- 08_database_results.ipynb - SQL query results to Pydantic
- 09_advanced_features.ipynb - Generators, type precedence, unicode
Legacy Examples
- polars_inference.ipynb - Polars-specific inference
- articuno_pandas_pyarrow_example.ipynb - Pandas with PyArrow
- pandas_nested.ipynb - Nested Pandas structures
- optional_nested_example.ipynb - Optional nested fields
- articuno_inference_demo.ipynb - General inference demo
Note: All example notebooks have been executed and saved with outputs, so you can view the results directly on GitHub without running them.
📚 Advanced Usage
Temporal Types Support
Articuno fully supports datetime, date, and timedelta types:
from datetime import datetime, date, timedelta
from articuno import infer_pydantic_model
data = [
{
"event_id": 1,
"event_date": date(2024, 1, 15),
"timestamp": datetime(2024, 1, 15, 10, 30),
"duration": timedelta(hours=2, minutes=30)
}
]
Model = infer_pydantic_model(data, model_name="Event")
# Fields will have correct datetime.date, datetime.datetime, datetime.timedelta types
PyArrow-Backed Pandas Columns
Full support for high-performance PyArrow dtypes:
import pandas as pd
from articuno import infer_pydantic_model
df = pd.DataFrame({
"id": pd.Series([1, 2, 3], dtype="int64[pyarrow]"),
"name": pd.Series(["Alice", "Bob", "Charlie"], dtype="string[pyarrow]"),
"created": pd.Series([
datetime(2024, 1, 1),
datetime(2024, 1, 2),
datetime(2024, 1, 3)
], dtype=pd.ArrowDtype(pa.timestamp("ms"))),
"active": pd.Series([True, False, True], dtype="bool[pyarrow]")
})
Model = infer_pydantic_model(df, model_name="ArrowModel")
Supported PyArrow types:
int64[pyarrow],int32[pyarrow], etc.string[pyarrow]bool[pyarrow]timestamp[pyarrow]→datetime.datetimedate32[pyarrow],date64[pyarrow]→datetime.dateduration[pyarrow]→datetime.timedelta
Nested Structures
Handle complex nested data with ease:
data = [
{
"user_id": 1,
"profile": {
"name": "Alice",
"age": 30,
"preferences": {
"theme": "dark",
"notifications": True
}
},
"tags": ["python", "data-science"]
}
]
Model = infer_pydantic_model(data, model_name="UserProfile")
# Nested dicts become nested Pydantic models
# Lists are preserved with List[...] typing
Force Optional Fields
Make all fields optional regardless of the data:
from articuno import infer_pydantic_model
df = pl.DataFrame({
"required": [1, 2, 3],
"also_required": ["a", "b", "c"]
})
# Force all fields to be Optional
Model = infer_pydantic_model(df, force_optional=True)
# Now you can create instances with None values
instance = Model(required=None, also_required=None)
Limit Schema Scanning
For large datasets, limit how many records are scanned:
# Only scan first 100 records for schema inference
Model = infer_pydantic_model(
large_dataset,
model_name="LargeModel",
max_scan=100
)
Memory-Efficient Processing
df_to_pydantic returns a generator for memory efficiency:
# Generator - memory efficient for large datasets
instances_gen = df_to_pydantic(large_df, model_name="Record")
# Process one at a time
for instance in instances_gen:
process(instance)
# Or collect all at once if needed
instances_list = list(df_to_pydantic(df, model_name="Record"))
Code Generation
Generate clean Python code for your models:
from articuno import infer_pydantic_model
from articuno.codegen import generate_class_code
# Infer model from data
Model = infer_pydantic_model(data, model_name="User")
# Generate Python code
code = generate_class_code(Model)
print(code)
# Or save to file
code = generate_class_code(Model, output_path="models.py")
Pre-defined Models
Use a pre-defined model instead of inferring:
from pydantic import BaseModel
from articuno import df_to_pydantic
class UserModel(BaseModel):
id: int
name: str
email: str
# Use your existing model
instances = list(df_to_pydantic(df, model=UserModel))
SQLAlchemy Model Conversion
Convert between SQLAlchemy declarative models and Pydantic:
from sqlalchemy import Column, Integer, String, Float
from sqlalchemy.orm import DeclarativeBase
from articuno import infer_pydantic_model, pydantic_to_sqlalchemy
class Base(DeclarativeBase):
pass
class User(Base):
__tablename__ = "users"
id = Column(Integer, primary_key=True)
name = Column(String(255), nullable=False)
email = Column(String(255), nullable=True)
# Convert SQLAlchemy model to Pydantic
PydanticModel = infer_pydantic_model(User, model_name="UserModel")
# Convert Pydantic model to SQLAlchemy
from pydantic import BaseModel as PydanticBase
class ProductModel(PydanticBase):
id: int
name: str
price: float
SQLAlchemyModel = pydantic_to_sqlalchemy(ProductModel, model_name="Product")
SQLModel Conversion
Convert between SQLModel and Pydantic (SQLModel already extends Pydantic):
from sqlmodel import SQLModel, Field
from articuno import infer_pydantic_model, pydantic_to_sqlmodel
class User(SQLModel, table=True):
__tablename__ = "users"
id: int | None = Field(default=None, primary_key=True)
name: str
email: str | None = None
# Convert SQLModel to Pydantic
PydanticModel = infer_pydantic_model(User, model_name="UserModel")
# Convert Pydantic to SQLModel
from pydantic import BaseModel
class ProductModel(BaseModel):
id: int | None = None
name: str
price: float
SQLModelClass = pydantic_to_sqlmodel(ProductModel, model_name="Product")
PySpark DataFrame Conversion
Convert between PySpark DataFrames and Pydantic:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, StringType
from articuno import infer_pydantic_model, df_to_pydantic, pydantic_to_pyspark
spark = SparkSession.builder.appName("articuno").getOrCreate()
# Create PySpark DataFrame
data = [(1, "Alice"), (2, "Bob")]
schema = StructType([
StructField("id", LongType(), False),
StructField("name", StringType(), False),
])
df = spark.createDataFrame(data, schema=schema)
# Convert PySpark DataFrame to Pydantic
Model = infer_pydantic_model(df, model_name="UserModel")
instances = list(df_to_pydantic(df, model=Model))
# Convert Pydantic instances to PySpark DataFrame
from pydantic import BaseModel
class UserModel(BaseModel):
id: int
name: str
instances = [UserModel(id=1, name="Alice"), UserModel(id=2, name="Bob")]
df = pydantic_to_pyspark(instances, model=UserModel)
⚙️ Supported Type Mappings
| Polars Type | Pandas Type (incl. PyArrow) | Dict/Iterable | SQLAlchemy Type | SQLModel Type | PySpark Type | Pydantic Type |
|---|---|---|---|---|---|---|
pl.Int*, pl.UInt* |
int64, Int64, int64[pyarrow] |
int |
Integer, BigInteger |
int |
IntegerType, LongType |
int |
pl.Float* |
float64, float64[pyarrow] |
float |
Float, Numeric |
float |
FloatType, DoubleType |
float |
pl.Utf8, pl.String |
object, string[pyarrow] |
str |
String, Text |
str |
StringType |
str |
pl.Boolean |
bool, bool[pyarrow] |
bool |
Boolean |
bool |
BooleanType |
bool |
pl.Date |
datetime64[ns], date[pyarrow] |
date |
Date |
date |
DateType |
datetime.date |
pl.Datetime |
datetime64[ns], timestamp[pyarrow] |
datetime |
DateTime |
datetime |
TimestampType |
datetime.datetime |
pl.Duration |
timedelta64[ns], duration[pyarrow] |
timedelta |
- | - | - | datetime.timedelta |
pl.List |
list |
list |
- | List[...] |
ArrayType |
List[...] |
pl.Struct |
dict (nested) |
dict (nested) |
- | - | StructType |
Nested BaseModel |
pl.Null |
None, NaN |
None |
nullable=True |
Optional[...] |
nullable=True |
Optional[...] |
🎯 Real-World Examples
API Response Processing
# Process API responses
api_data = [
{
"status": "success",
"data": {
"user_id": 123,
"username": "alice",
"created_at": datetime(2024, 1, 15, 10, 30)
}
}
]
APIResponse = infer_pydantic_model(api_data, model_name="APIResponse")
instances = list(df_to_pydantic(api_data, model=APIResponse))
SQL Query Results
import sqlite3
from articuno import infer_pydantic_model, df_to_pydantic
# Get results from database
conn = sqlite3.connect("database.db")
conn.row_factory = sqlite3.Row
cursor = conn.execute("SELECT * FROM users")
rows = [dict(row) for row in cursor.fetchall()]
# Convert to Pydantic
UserModel = infer_pydantic_model(rows, model_name="User")
users = list(df_to_pydantic(rows, model=UserModel))
E-commerce Order Processing
orders = [
{
"order_id": 1001,
"customer": {"id": 501, "name": "Alice", "email": "alice@example.com"},
"items": [
{"product": "Laptop", "quantity": 1, "price": 999.99},
{"product": "Mouse", "quantity": 2, "price": 29.99}
],
"total": 1059.97,
"created_at": datetime(2024, 1, 15, 10, 30)
}
]
Order = infer_pydantic_model(orders, model_name="Order")
🧪 Testing & Quality
Articuno is thoroughly tested and type-checked:
# Run tests
pytest
# Run with coverage
pytest --cov=articuno --cov-report=term-missing
# Type checking
mypy articuno
# Linting
ruff check .
Test Statistics:
- 112 comprehensive tests
- 87% code coverage
- All tests passing ✅
- Type-checked with mypy ✅
- Linted with ruff ✅
🔧 Development
# Clone the repository
git clone https://github.com/eddiethedean/articuno.git
cd articuno
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run type checking
mypy articuno
# Run linting
ruff check .
💡 Tips & Best Practices
- Use generators for large datasets:
df_to_pydanticreturns a generator by default for memory efficiency - Limit scanning for performance: Use
max_scanparameter when dealing with huge datasets - Validate model names: Articuno automatically validates that model names are valid Python identifiers
- Handle optional fields: Use
force_optional=Truewhen working with sparse data - Type precedence: Articuno correctly handles bool vs int (bool is checked first)
🐛 Troubleshooting
Import Errors
If you get import errors for polars or pandas:
pip install articuno[polars] # or [pandas]
PyArrow Issues
For PyArrow support:
pip install pyarrow
Generator Indexing
df_to_pydantic returns a generator. Convert to list if you need indexing:
instances = list(df_to_pydantic(df, model_name="Model"))
print(instances[0]) # Now you can index
📖 API Reference
Main Functions
infer_pydantic_model(source, model_name="AutoModel", force_optional=False, max_scan=1000)
Infer a Pydantic model class from a DataFrame or dict iterable.
Parameters:
source: Pandas DataFrame, Polars DataFrame, or iterable of dictsmodel_name: Name for the generated model (must be valid Python identifier)force_optional: Make all fields optionalmax_scan: Max records to scan for dict iterables
Returns: Type[BaseModel]
df_to_pydantic(source, model=None, model_name=None, force_optional=False, max_scan=1000)
Convert DataFrame or dict iterable to Pydantic instances.
Parameters:
source: Pandas DataFrame, Polars DataFrame, or iterable of dictsmodel: Optional pre-defined model to usemodel_name: Name for inferred model ifmodelis Noneforce_optional: Make all fields optionalmax_scan: Max records to scan for dict iterables
Returns: Generator[BaseModel, None, None]
generate_class_code(model, output_path=None, model_name=None)
Generate Python code from a Pydantic model.
Parameters:
model: Pydantic model classoutput_path: Optional file path to write code tomodel_name: Optional name override
Returns: str (the generated code)
🔗 Links
- GitHub Repository
- Datamodel Code Generator
- Poldantic (Polars integration)
- Polars
- Pandas
- PyArrow
📄 License
MIT © Odos Matthews
🙏 Acknowledgments
- Built with Pydantic
- Code generation powered by datamodel-code-generator
- Polars support via poldantic
📝 Changelog
v0.9.0
- ✨ Added full datetime/date/timedelta support across all backends
- ✨ Added PyArrow temporal type support (timestamp, date, duration)
- ✨ Added model name validation (ensures valid Python identifiers)
- 🐛 Fixed bool vs int type precedence
- 🐛 Fixed DataFrame vs iterable detection order
- 🐛 Fixed temporary directory cleanup in code generation
- 🐛 Added defensive checks for empty samples
- 📝 Improved documentation with generator behavior notes
- 🧪 Added comprehensive test suite (112 tests, 87% coverage)
- 🔍 Full mypy type checking
- 📏 Ruff linting compliance
- 📓 Added 9 comprehensive example notebooks with outputs
- 📚 Enhanced README with complete guide and examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file articuno-0.10.0.tar.gz.
File metadata
- Download URL: articuno-0.10.0.tar.gz
- Upload date:
- Size: 39.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7da326e5880d4414bbf589f76f37980e3bfcfaf3ba4a5d2b638c8bfddb43254
|
|
| MD5 |
295f129b4e2b541d8f1a2fe88c982c96
|
|
| BLAKE2b-256 |
5c7268a27de690250660daa08ddb67f892a201289b7badffcd6caa39e6e47f9e
|
File details
Details for the file articuno-0.10.0-py3-none-any.whl.
File metadata
- Download URL: articuno-0.10.0-py3-none-any.whl
- Upload date:
- Size: 25.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46c456f9d59c779eb20d443b48ace9c62bf2afa9289a48087d5bb91ae63e4f45
|
|
| MD5 |
94f7ea1cf5d6ba13721b894bbaece34b
|
|
| BLAKE2b-256 |
3d2203ef56e7b249cbf29e9947bf9fe84154cb2555fbaff21ded85269e6518ac
|