Skip to main content

Generate pandas DataFrames using polyfactory for testing and development

Project description

Polypandas

PyPI version Python 3.8+ License: MIT Code style: ruff

Generate type-safe pandas DataFrames effortlessly using polyfactory.

Inspired by polyspark.


Why Polypandas?

Creating test data for pandas applications is tedious. Polypandas makes it effortless by generating realistic test DataFrames from your Python data models, with automatic schema inference so columns get the right dtypes even when values are null.

from dataclasses import dataclass
from polypandas import pandas_factory

@pandas_factory
@dataclass
class User:
    id: int
    name: str
    email: str

# Generate 1000 rows instantly
df = User.build_dataframe(size=1000)

Installation

Base install (pandas + polyfactory):

pip install polypandas

Optional: PyArrow for proper nested struct columns (otherwise nested fields are object columns of dicts):

pip install polypandas[pyarrow]

Development (tests, lint, type-checking):

pip install "polypandas[dev]"

Requirements: Python 3.8+, pandas ≥1.3, polyfactory ≥2.0.


Quick start

Decorator (recommended)

from dataclasses import dataclass
from typing import Optional
from polypandas import pandas_factory

@pandas_factory
@dataclass
class Product:
    product_id: int
    name: str
    price: float
    description: Optional[str] = None
    in_stock: bool = True

df = Product.build_dataframe(size=100)
print(df.head())

Generate dicts, then convert to DataFrame

dicts = Product.build_dicts(size=1000)
df = Product.create_dataframe_from_dicts(dicts)

Classic factory pattern

from polypandas import PandasFactory

class ProductFactory(PandasFactory[Product]):
    __model__ = Product

df = ProductFactory.build_dataframe(size=100)

Convenience function (no factory class)

from polypandas import build_pandas_dataframe

df = build_pandas_dataframe(Product, size=100)

Pydantic models

from pydantic import BaseModel
from polypandas import pandas_factory

@pandas_factory
class Order(BaseModel):
    order_id: int
    customer_id: int
    total: float

df = Order.build_dataframe(size=500)

Nested structs (optional PyArrow)

With pip install polypandas[pyarrow], nested dataclasses become proper struct columns (PyArrow-backed). Without PyArrow they are object columns of dicts.

from dataclasses import dataclass
from polypandas import pandas_factory

@dataclass
class Address:
    street: str
    city: str
    zipcode: str

@pandas_factory
@dataclass
class Person:
    id: int
    name: str
    address: Address

# Auto: use PyArrow when available and model has nested structs
df = Person.build_dataframe(size=50)

# Force PyArrow (when installed)
df = Person.build_dataframe(size=50, use_pyarrow=True)

# Force standard path (nested column = object of dicts)
df = Person.build_dataframe(size=50, use_pyarrow=False)

Helpers:

  • has_nested_structs(Model)True if the model has any nested struct or list-of-struct field.
  • infer_pyarrow_schema(Model) — Returns a pyarrow.Schema when PyArrow is installed, else None.
  • is_pyarrow_available() — Runtime check for PyArrow.

Key features

  • Factory pattern — Uses polyfactory for data generation.
  • Type-safe schema — Python types become pandas dtypes automatically.
  • Robust null handling — Schema from types avoids dtype issues with all-null columns.
  • Nested structs — Optional PyArrow support for proper struct columns; otherwise object columns of dicts.
  • Complex types — Lists and dicts as object columns; nested models as structs (with PyArrow) or dicts.
  • Flexible models — Dataclasses, Pydantic v2 models, TypedDicts.
  • Testing utilitiesassert_dataframe_equal, assert_schema_equal, assert_column_exists, and more.
  • Data I/O — Save/load Parquet, JSON, CSV; JSON lines for dicts.

Type mapping

Python type Pandas dtype
str object
int int64
float float64
bool bool
datetime datetime64[ns]
date datetime64[ns]
Optional[T] same as T
List[T] object
Dict[K, V] object
Nested model object or PyArrow struct (with [pyarrow])

API reference

Factory

API Description
@pandas_factory Decorator: adds build_dataframe, build_dicts, create_dataframe_from_dicts to the model.
PandasFactory[Model] Base factory class; set __model__ = Model.
build_dataframe(size=10, schema=None, use_pyarrow=None, **kwargs) Build a pandas DataFrame.
build_dicts(size=10, **kwargs) Build a list of dicts (no DataFrame).
create_dataframe_from_dicts(data, schema=None) Turn a list of dicts into a DataFrame.
build_pandas_dataframe(model, size=10, schema=None, use_pyarrow=None, **kwargs) One-off build without a factory class.

Schema

API Description
infer_schema(model, schema=None) Infer a dict of column name → pandas dtype.
python_type_to_pandas_dtype(python_type) Map a Python type to a pandas dtype string.
has_nested_structs(model) Whether the model has nested struct/list-of-struct fields.
infer_pyarrow_schema(model) PyArrow schema for the model, or None if PyArrow not installed.

Runtime

API Description
is_pandas_available() Whether pandas can be imported.
is_pyarrow_available() Whether PyArrow can be imported.

Testing

API Description
assert_dataframe_equal(df1, df2, ...) Compare DataFrames (optional order, dtypes, tolerances).
assert_schema_equal(df1, df2, ...) Compare column dtypes.
assert_dtypes_equal(df1, df2, ...) Alias for schema/dtype comparison.
assert_approx_count(df, expected_count, tolerance=0.1) Assert row count within tolerance.
assert_column_exists(df, *columns) Assert columns exist.
assert_no_duplicates(df, columns=None) Assert no duplicate rows.
get_column_stats(df, column) Basic stats (count, nulls, distinct, min/max/mean for numeric).

I/O

API Description
save_as_parquet(df, path, **kwargs) Save DataFrame as Parquet.
save_as_json(df, path, **kwargs) Save as JSON.
save_as_csv(df, path, header=True, **kwargs) Save as CSV.
load_parquet(path, **kwargs) Load Parquet into DataFrame.
load_json(path, **kwargs) Load JSON.
load_csv(path, **kwargs) Load CSV.
load_and_validate(path, expected_schema=None, ...) Load and optionally validate columns/dtypes.
save_dicts_as_json(data, path) Save list of dicts as JSON lines.
load_dicts_from_json(path) Load JSON lines into list of dicts.

Exceptions

  • PolypandasError — base
  • PandasNotAvailableError — pandas required but not installed
  • SchemaInferenceError — schema cannot be inferred
  • UnsupportedTypeError — type has no pandas/PyArrow mapping
  • DataIOError — I/O failure
  • DataFrameComparisonError — assertion failure in testing helpers

Testing utilities

from polypandas import (
    assert_dataframe_equal,
    assert_schema_equal,
    assert_approx_count,
    assert_column_exists,
    assert_no_duplicates,
    get_column_stats,
)

assert_dataframe_equal(df1, df2, check_order=False, rtol=1e-5)
assert_schema_equal(df1, df2)
assert_column_exists(df, "user_id", "name", "email")
assert_no_duplicates(df, columns=["user_id"])
stats = get_column_stats(df, "amount")

Data I/O

DataFrames:

from polypandas import (
    save_as_parquet,
    save_as_json,
    save_as_csv,
    load_parquet,
    load_json,
    load_csv,
    load_and_validate,
    infer_schema,
)

save_as_parquet(df, "users.parquet")
save_as_csv(df, "users.csv", header=True)

df = load_parquet("users.parquet")
df = load_and_validate("users.parquet", expected_schema=infer_schema(User))

JSON lines (list of dicts):

from polypandas import save_dicts_as_json, load_dicts_from_json

dicts = User.build_dicts(size=100)
save_dicts_as_json(dicts, "users.jsonl")
loaded = load_dicts_from_json("users.jsonl")

License & related

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polypandas-0.1.0.tar.gz (68.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polypandas-0.1.0-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file polypandas-0.1.0.tar.gz.

File metadata

  • Download URL: polypandas-0.1.0.tar.gz
  • Upload date:
  • Size: 68.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for polypandas-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9ffe748ec3ea38899127b961ad216fb1df6753384c98f09cbb6d6db0fdd55b05
MD5 ac6dac271f8f106551735bbe408022d4
BLAKE2b-256 40215e2003a5ad5095ec32c174c17d6eb36b89d83e9fea36a1f629a19188546f

See more details on using hashes here.

File details

Details for the file polypandas-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: polypandas-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for polypandas-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9b16fd0119d5379673af6b47a83a7c1838d79409940ac8f6b3477da7e08421ef
MD5 c8de30b1b260fb37969aa15108d265a4
BLAKE2b-256 fced92660f0bec86f7993891cd8c6fe038318cf98d2bd11389c1c8a3aa171bb4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page