A fast dataframe implementation with Pydantic integration
Project description
FastDataFrame
FastDataFrame bridges Pydantic models and dataframe/table backends. A FastDataFrame model owns backend-neutral column definitions, and backend modules expose stateless functions for Polars, PyArrow, and Apache Iceberg.
Supported backends:
- Polars
DataFrameandLazyFrame - PyArrow schemas
- Apache Iceberg schemas/tables through PyIceberg
Core idea
Define the schema once:
from typing import Annotated
from pydantic import BaseModel, Field
from fastdataframe import ColumnInfo, FastDataFrameModel, Int32
class User(BaseModel):
user_id: Annotated[
int,
Field(validation_alias="userId", serialization_alias="user_id"),
ColumnInfo(dtype=Int32()),
]
name: str
score: float = 0.0
nickname: str | None = None
FastUser = FastDataFrameModel.from_base_model(User)
Then generate backend-native schemas with stateless backend functions:
import fastdataframe.polars as fpl
import fastdataframe.pyarrow as farrow
import fastdataframe.iceberg as fice
polars_schema = fpl.schema(FastUser)
arrow_schema = farrow.schema(FastUser)
iceberg_schema = fice.schema(FastUser)
Column definitions
FastDataFrameModel owns immutable, backend-neutral column definitions:
FastUser.column_definitions
FastUser.column_map
ColumnInfo is optional user-authored metadata. Fields without ColumnInfo receive default metadata.
class Trade(FastDataFrameModel):
trade_id: str
quantity: Annotated[int, ColumnInfo(dtype=Int32(), is_unique=False)]
Name accessors
Resolved names are available as immutable accessors keyed by Python field name:
FastUser.serialization_names.user_id # "user_id"
FastUser.validation_names.user_id # "userId"
FastUser.storage_names.user_id # "user_id"
FastUser.serialization_names["user_id"]
The storage name is the canonical dataframe/table column name and defaults to the Pydantic serialization name.
Dtype refinements
ColumnInfo(dtype=...) can refine backend schema generation while the Python annotation remains the semantic type.
Initial backend-neutral scalar dtypes include:
Boolean,String,BinaryInt8,Int16,Int32,Int64Float32,Float64Date,Time,TimestampDecimal
Unsigned integer dtypes are intentionally not included initially. Small signed integers are widened when mapped to Iceberg where necessary.
Polars
import polars as pl
import fastdataframe.polars as fpl
raw = pl.DataFrame({"user_id": ["1"], "name": ["Alice"], "score": ["1.5"], "nickname": [None]})
cast_df = fpl.cast(FastUser, raw)
errors = fpl.validate_schema(FastUser, cast_df)
fpl.string_schema(FastUser) returns a schema with all columns as strings for ingest flows.
PyArrow
import fastdataframe.pyarrow as farrow
schema = farrow.schema(FastUser)
string_schema = farrow.string_schema(FastUser)
PyArrow schemas encode nullability from Optional / None unions. Pydantic defaults do not imply nullable storage.
Iceberg
import fastdataframe.iceberg as fice
schema = fice.schema(FastUser)
Iceberg migration support is additive-only by default:
fice.apply_additive_migration(FastUser, table)
Destructive deletes are intentionally not automatic.
For Polars-to-Iceberg persistence, data is written through the FastDataFrame-generated PyArrow schema boundary:
fice.append_polars(FastUser, table, cast_df)
This is important because Polars schemas do not encode column nullability in the same way as PyArrow and Iceberg.
Column lifecycle
Deprecated fields remain model fields, remain in schemas, and must be nullable:
class UserV2(FastDataFrameModel):
old_score: Annotated[float | None, ColumnInfo(deprecated=True)]
score: float
Deprecated and removed column names can be reserved through model config to prevent unsafe reuse:
class UserV3(FastDataFrameModel):
model_config = {
"fastdataframe_deprecated_column_names": {"old_score"},
"fastdataframe_removed_column_names": {"very_old_score"},
}
score: float
Removed names remain reserved even if a backend later physically deletes the column.
Installation
pip install fastdataframe
# or with optional backends
pip install 'fastdataframe[polars,pyarrow,iceberg]'
Development
uv sync --all-extras
uv run pytest tests/
uv run ruff check .
uv run ruff format .
uv run ty check
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastdataframe-0.2.0-py3-none-any.whl.
File metadata
- Download URL: fastdataframe-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08d8ca03699110fbf53f3bd68fe84e992c66b34cc273ea076e985255ef4a3eb6
|
|
| MD5 |
3a780c036b3c2c9169ed0dcdd5855a72
|
|
| BLAKE2b-256 |
e24c56645a2f8e5311468f0ab785d3a707d3a036031bf092fc2634b531fd07ba
|
Provenance
The following attestation bundles were made for fastdataframe-0.2.0-py3-none-any.whl:
Publisher:
ci.yml on davzucky/fastdataframe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fastdataframe-0.2.0-py3-none-any.whl -
Subject digest:
08d8ca03699110fbf53f3bd68fe84e992c66b34cc273ea076e985255ef4a3eb6 - Sigstore transparency entry: 1643550156
- Sigstore integration time:
-
Permalink:
davzucky/fastdataframe@4f481933a7e444810316760889c83ed2b135f7a6 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/davzucky
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@4f481933a7e444810316760889c83ed2b135f7a6 -
Trigger Event:
release
-
Statement type: