Generate pandas DataFrames using polyfactory for testing and development
Project description
Polypandas
Generate type-safe pandas DataFrames effortlessly using polyfactory.
Inspired by polyspark.
Why Polypandas?
Creating test data for pandas applications is tedious. Polypandas makes it effortless by generating realistic test DataFrames from your Python data models, with automatic schema inference so columns get the right dtypes even when values are null.
from dataclasses import dataclass
from polypandas import pandas_factory
@pandas_factory
@dataclass
class User:
id: int
name: str
email: str
# Generate 1000 rows instantly
df = User.build_dataframe(size=1000)
Installation
Base install (pandas + polyfactory):
pip install polypandas
Optional: PyArrow for proper nested struct columns (otherwise nested fields are object columns of dicts):
pip install polypandas[pyarrow]
Development (tests, lint, type-checking):
pip install "polypandas[dev]"
Requirements: Python 3.8+, pandas ≥1.3, polyfactory ≥2.0.
Quick start
Decorator (recommended)
from dataclasses import dataclass
from typing import Optional
from polypandas import pandas_factory
@pandas_factory
@dataclass
class Product:
product_id: int
name: str
price: float
description: Optional[str] = None
in_stock: bool = True
df = Product.build_dataframe(size=100)
print(df.head())
Generate dicts, then convert to DataFrame
dicts = Product.build_dicts(size=1000)
df = Product.create_dataframe_from_dicts(dicts)
Classic factory pattern
from polypandas import PandasFactory
class ProductFactory(PandasFactory[Product]):
__model__ = Product
df = ProductFactory.build_dataframe(size=100)
Convenience function (no factory class)
from polypandas import build_pandas_dataframe
df = build_pandas_dataframe(Product, size=100)
Pydantic models
from pydantic import BaseModel
from polypandas import pandas_factory
@pandas_factory
class Order(BaseModel):
order_id: int
customer_id: int
total: float
df = Order.build_dataframe(size=500)
Nested structs (optional PyArrow)
With pip install polypandas[pyarrow], nested dataclasses become proper struct columns (PyArrow-backed). Without PyArrow they are object columns of dicts.
from dataclasses import dataclass
from polypandas import pandas_factory
@dataclass
class Address:
street: str
city: str
zipcode: str
@pandas_factory
@dataclass
class Person:
id: int
name: str
address: Address
# Auto: use PyArrow when available and model has nested structs
df = Person.build_dataframe(size=50)
# Force PyArrow (when installed)
df = Person.build_dataframe(size=50, use_pyarrow=True)
# Force standard path (nested column = object of dicts)
df = Person.build_dataframe(size=50, use_pyarrow=False)
Helpers:
has_nested_structs(Model)—Trueif the model has any nested struct or list-of-struct field.infer_pyarrow_schema(Model)— Returns apyarrow.Schemawhen PyArrow is installed, elseNone.is_pyarrow_available()— Runtime check for PyArrow.
Key features
- Factory pattern — Uses polyfactory for data generation.
- Type-safe schema — Python types become pandas dtypes automatically.
- Robust null handling — Schema from types avoids dtype issues with all-null columns.
- Nested structs — Optional PyArrow support for proper struct columns; otherwise object columns of dicts.
- Complex types — Lists and dicts as object columns; nested models as structs (with PyArrow) or dicts.
- Flexible models — Dataclasses, Pydantic v2 models, TypedDicts.
- Testing utilities —
assert_dataframe_equal,assert_schema_equal,assert_column_exists, and more. - Data I/O — Save/load Parquet, JSON, CSV; JSON lines for dicts.
Type mapping
| Python type | Pandas dtype |
|---|---|
str |
object |
int |
int64 |
float |
float64 |
bool |
bool |
datetime |
datetime64[ns] |
date |
datetime64[ns] |
Optional[T] |
same as T |
List[T] |
object |
Dict[K, V] |
object |
| Nested model | object or PyArrow struct (with [pyarrow]) |
API reference
Factory
| API | Description |
|---|---|
@pandas_factory |
Decorator: adds build_dataframe, build_dicts, create_dataframe_from_dicts to the model. |
PandasFactory[Model] |
Base factory class; set __model__ = Model. |
build_dataframe(size=10, schema=None, use_pyarrow=None, **kwargs) |
Build a pandas DataFrame. |
build_dicts(size=10, **kwargs) |
Build a list of dicts (no DataFrame). |
create_dataframe_from_dicts(data, schema=None) |
Turn a list of dicts into a DataFrame. |
build_pandas_dataframe(model, size=10, schema=None, use_pyarrow=None, **kwargs) |
One-off build without a factory class. |
Schema
| API | Description |
|---|---|
infer_schema(model, schema=None) |
Infer a dict of column name → pandas dtype. |
python_type_to_pandas_dtype(python_type) |
Map a Python type to a pandas dtype string. |
has_nested_structs(model) |
Whether the model has nested struct/list-of-struct fields. |
infer_pyarrow_schema(model) |
PyArrow schema for the model, or None if PyArrow not installed. |
Runtime
| API | Description |
|---|---|
is_pandas_available() |
Whether pandas can be imported. |
is_pyarrow_available() |
Whether PyArrow can be imported. |
Testing
| API | Description |
|---|---|
assert_dataframe_equal(df1, df2, ...) |
Compare DataFrames (optional order, dtypes, tolerances). |
assert_schema_equal(df1, df2, ...) |
Compare column dtypes. |
assert_dtypes_equal(df1, df2, ...) |
Alias for schema/dtype comparison. |
assert_approx_count(df, expected_count, tolerance=0.1) |
Assert row count within tolerance. |
assert_column_exists(df, *columns) |
Assert columns exist. |
assert_no_duplicates(df, columns=None) |
Assert no duplicate rows. |
get_column_stats(df, column) |
Basic stats (count, nulls, distinct, min/max/mean for numeric). |
I/O
| API | Description |
|---|---|
save_as_parquet(df, path, **kwargs) |
Save DataFrame as Parquet. |
save_as_json(df, path, **kwargs) |
Save as JSON. |
save_as_csv(df, path, header=True, **kwargs) |
Save as CSV. |
load_parquet(path, **kwargs) |
Load Parquet into DataFrame. |
load_json(path, **kwargs) |
Load JSON. |
load_csv(path, **kwargs) |
Load CSV. |
load_and_validate(path, expected_schema=None, ...) |
Load and optionally validate columns/dtypes. |
save_dicts_as_json(data, path) |
Save list of dicts as JSON lines. |
load_dicts_from_json(path) |
Load JSON lines into list of dicts. |
Exceptions
PolypandasError— basePandasNotAvailableError— pandas required but not installedSchemaInferenceError— schema cannot be inferredUnsupportedTypeError— type has no pandas/PyArrow mappingDataIOError— I/O failureDataFrameComparisonError— assertion failure in testing helpers
Testing utilities
from polypandas import (
assert_dataframe_equal,
assert_schema_equal,
assert_approx_count,
assert_column_exists,
assert_no_duplicates,
get_column_stats,
)
assert_dataframe_equal(df1, df2, check_order=False, rtol=1e-5)
assert_schema_equal(df1, df2)
assert_column_exists(df, "user_id", "name", "email")
assert_no_duplicates(df, columns=["user_id"])
stats = get_column_stats(df, "amount")
Data I/O
DataFrames:
from polypandas import (
save_as_parquet,
save_as_json,
save_as_csv,
load_parquet,
load_json,
load_csv,
load_and_validate,
infer_schema,
)
save_as_parquet(df, "users.parquet")
save_as_csv(df, "users.csv", header=True)
df = load_parquet("users.parquet")
df = load_and_validate("users.parquet", expected_schema=infer_schema(User))
JSON lines (list of dicts):
from polypandas import save_dicts_as_json, load_dicts_from_json
dicts = User.build_dicts(size=100)
save_dicts_as_json(dicts, "users.jsonl")
loaded = load_dicts_from_json("users.jsonl")
License & related
- License: MIT — see LICENSE.
- Docs: docs/roadmap.md for roadmap and ideas.
- Related: polyspark, polyfactory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polypandas-0.1.0.tar.gz.
File metadata
- Download URL: polypandas-0.1.0.tar.gz
- Upload date:
- Size: 68.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ffe748ec3ea38899127b961ad216fb1df6753384c98f09cbb6d6db0fdd55b05
|
|
| MD5 |
ac6dac271f8f106551735bbe408022d4
|
|
| BLAKE2b-256 |
40215e2003a5ad5095ec32c174c17d6eb36b89d83e9fea36a1f629a19188546f
|
File details
Details for the file polypandas-0.1.0-py3-none-any.whl.
File metadata
- Download URL: polypandas-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b16fd0119d5379673af6b47a83a7c1838d79409940ac8f6b3477da7e08421ef
|
|
| MD5 |
c8de30b1b260fb37969aa15108d265a4
|
|
| BLAKE2b-256 |
fced92660f0bec86f7993891cd8c6fe038318cf98d2bd11389c1c8a3aa171bb4
|