Skip to main content

A Datastore mode DataFrame interface with pandas and optional Polars support.

Project description

datastore-pandas

datastore-pandas is a schema-aware DataFrame interface for Firestore in Datastore mode. It provides a pandas API by default and an optional Polars adapter through datastore-pandas[polars]. It is designed for Datastore's real execution model: indexed entity queries, key lookups, projections, cursor scans, transactions, and batched entity writes.

It does not try to turn Datastore into SQL or a general analytics engine. Datastore is an operational NoSQL entity store, so joins, aggregations, and broad server-side query planning are outside the package boundary. This package keeps that boundary explicit so DataFrame workflows remain convenient without hiding Datastore's limits.

Current Status

This is an initial implementation scaffold. The code already covers the core shape of the package:

  • explicit schema objects for safe DataFrame-to-entity conversion
  • first-class Datastore key representation and row-to-key mapping
  • sparse entity writes that omit missing DataFrame values by default
  • typed conversion for strings, integers, floats, booleans, timestamps, blobs, keys, arrays, geo points, and embedded entities
  • query construction for projections, keys-only queries, distinct projections, ancestor queries, filters, orderings, and limits
  • batched writes with bounded concurrency
  • read-merge-write patch updates for partial DataFrames
  • dry-run write planning, read-only write blocking, and skip-unchanged writes
  • instantiated kind accessors for bound schema/client/query defaults
  • deterministic key-policy and audit timestamp helpers
  • logical duplicate cleanup planning and opt-in delete execution
  • a transaction helper for small read-modify-write workflows
  • index-planning helpers that produce index.yaml-style suggestions
  • emulator examples for local integration testing
  • an optional Polars adapter with matching read, chunked-read, write, and patch operations

The package is not yet a complete production client. The current high-level write path relies on google-cloud-datastore; lower-level mutation support is still the right next step for property masks, compare-and-swap writes, generated-key result metadata, and conflict details.

License And Disclaimer

This repository uses The Unlicense, a public-domain-style dedication. The repository does not identify an owner and does not make ownership claims over the software.

Before using, copying, modifying, or relying on this repository, read the Disclaimer. In short: this project is experimental, provided without warranty, not recommended for any particular use, not guaranteed to be maintained, and includes AI-generated or AI-assisted software and documentation.

Release history is documented in the Changelog.

Installation

From this repository:

cd <repo>
python -m pip install -e ".[test]"

For the optional Polars adapter:

python -m pip install -e ".[test,polars]"

For release metadata and distribution checks:

python -m pip install -e ".[test,polars,release]"
python scripts\check_release.py

When installed from a package index, use:

python -m pip install datastore-pandas
python -m pip install "datastore-pandas[polars]"

The base package depends on pandas and google-cloud-datastore. The Polars adapter is optional and is installed with the polars extra.

Minimum supported third-party versions:

Package Minimum Notes
Python 3.10 Earlier Python versions are not supported.
google-cloud-datastore 2.20.0 Required for all Datastore reads and writes.
pandas 2.0.0 Required by the base package and timestamp conversion layer.
polars 1.0.0 Optional; required only for datastore_pandas.polars.
pytest 8.0.0 Test/dev extra only.
ruff 0.8.0 Test/dev extra only.
build 1.2.0 Release extra only.
twine 6.0.0 Release extra only.

The current local verification pass used Python 3.11 with google-cloud-datastore 2.24.0, pandas 3.0.3, polars 1.40.1, pytest 9.0.3, and ruff 0.15.14. For repeatable production or CI runs, pin exact dependency versions in your application environment.

For local development against the emulator, no Google Cloud credentials are needed when DATASTORE_EMULATOR_HOST is set. For real Datastore mode projects, configure Application Default Credentials:

gcloud auth application-default login

Quick Start

import datastore_pandas as dsp

schema = dsp.Schema(
    kind="Workout",
    key=dsp.KeySpec([
        ("User", dsp.KeyPart("user_id", kind="name")),
        ("Workout", dsp.KeyPart("workout_id", kind="name")),
    ]),
    properties={
        "started_at": dsp.Field(dsp.TimestampType(), nullable=False),
        "duration_sec": dsp.Field(dsp.Int64Type(), nullable=False),
        "distance_m": dsp.Field(dsp.Float64Type()),
        "activity_type": dsp.Field(dsp.StringType(), nullable=False),
        "notes": dsp.Field(dsp.StringType(), indexed=False),
    },
    strict=True,
)

df = dsp.read_datastore(
    kind="Workout",
    schema=schema,
    filters=[("activity_type", "=", "run")],
    projection=["started_at", "duration_sec", "distance_m"],
    order=["-started_at"],
    include_key=True,
)

report = dsp.to_datastore(
    df,
    schema=schema,
    mode="upsert",
    batch_size=400,
    max_workers=8,
)
report.raise_for_errors()

For Polars, install the optional extra and use the adapter module. The schema, query, key, batching, projection, and sparse-write behavior is shared:

import datastore_pandas as dsp
import datastore_pandas.polars as dsp_pl

df = dsp_pl.read_datastore(kind="Workout", schema=schema, limit=1000)
report = dsp_pl.to_datastore(df, schema=schema, batch_size=400)

Why Schema Is Required For Writes

Datastore mode does not enforce one fixed schema per kind. Two entities of the same kind can have different property sets and different property types. pandas and Polars, however, rectangularize data into columns. Without an explicit schema, a write adapter cannot safely tell whether a missing DataFrame cell means:

  • the property should be omitted
  • the property should be written as Datastore null
  • the property is required and the row is invalid
  • the property is an accidental extra column from another entity shape

datastore-pandas makes that decision explicit with Schema and Field.

schema = dsp.Schema(
    kind="Document",
    properties={
        "title": dsp.Field(dsp.StringType(), nullable=False),
        "summary": dsp.Field(dsp.StringType()),
        "raw_text": dsp.Field(dsp.StringType(), indexed=False),
    },
)

By default, nullable missing values are omitted:

encoded, excluded = schema.encode_properties({
    "title": "present",
    "summary": None,
})

assert encoded == {"title": "present"}

To intentionally write a Datastore null property, opt in:

dsp.Field(dsp.StringType(), missing_policy="null")

For required fields, use nullable=False. Missing or NA values will raise a schema error before the row reaches Datastore.

Sparse Entities And DataFrames

Sparse entities are a first-order design case. For example, a Workout kind might store swim, bike, and run entities together:

Property Swim Bike Run
started_at yes yes yes
duration_sec yes yes yes
pool_length_m yes no no
bike_power_w no yes no
run_cadence_spm no no yes

When these entities are read into pandas or Polars, the DataFrame must contain all columns, so absent Datastore properties appear as missing values. On write, those missing values should not become stored null properties on every entity. The default missing_policy is therefore omit.

This matters for correctness, index size, write cost, and query behavior.

Key Management

Keys are identity, not ordinary properties. datastore-pandas preserves the full Datastore key shape:

  • project
  • database
  • namespace
  • ancestor path
  • kind names
  • string name IDs
  • numeric IDs
  • incomplete leaf keys for auto-ID allocation

Example:

key = dsp.DatastoreKey(
    project="my-project",
    namespace="tenant-a",
    path=(
        ("User", "sample-user"),
        ("Workout", 123456789),
    ),
)

For DataFrame writes, KeySpec maps row columns to key path elements:

key = dsp.KeySpec(
    [
        ("User", dsp.KeyPart("user_id", kind="name")),
        ("Workout", dsp.KeyPart("workout_id", kind="name")),
    ],
    namespace_source="tenant",
)

The package keeps numeric IDs and string names distinct. 123 and "123" are different Datastore keys.

Reading

Use read_datastore for normal DataFrame reads:

df = dsp.read_datastore(
    kind="Workout",
    schema=schema,
    filters=[("activity_type", "=", "bike")],
    order=["-started_at"],
    limit=1000,
    include_key=True,
)

Use datastore_pandas.polars.read_datastore with the same arguments when you want a Polars DataFrame.

Use projections to read only indexed properties:

df = dsp.read_datastore(
    kind="Workout",
    schema=schema,
    projection=["started_at", "duration_sec", "distance_m"],
    order=["-started_at"],
)

Use keys-only queries when planning deletes, existence checks, or staged fan-out lookups:

keys = dsp.read_datastore(
    kind="Workout",
    keys_only=True,
    include_key=True,
)

Use iter_datastore for chunked processing:

for chunk in dsp.iter_datastore(kind="Workout", schema=schema, chunksize=1000):
    process(chunk)

Writing

Use to_datastore for full entity writes:

report = dsp.to_datastore(
    df,
    schema=schema,
    mode="upsert",
    batch_size=400,
    max_workers=8,
)
report.raise_for_errors()

The write path:

  • validates rows against the schema
  • builds Datastore keys from KeySpec or __key__
  • converts DataFrame values to Datastore-safe values
  • omits nullable missing values by default
  • excludes unindexed fields from indexes
  • rejects duplicate complete keys in one commit
  • chunks writes into bounded batches

For policy-aware writes, opt into planning:

plan = dsp.plan_datastore_write(
    df,
    schema=schema,
    skip_unchanged=True,
)

dry_report = dsp.to_datastore(df, schema=schema, dry_run=True)
report = dsp.to_datastore(df, schema=schema, skip_unchanged=True)

skip_unchanged=True reads existing entities before writing. Full writes compare the effective replacement payload, so extra existing properties still count as a change. Patch writes compare only the patched properties.

insert and update modes are represented in the API, but full correctness for those modes depends on the active Datastore batch backend exposing insert/update methods. upsert is the safest path in the initial scaffold.

Instantiated Kind Accessors

dsp.kind(...) creates a bound accessor for one schema/kind. It is not a DataFrame subclass; it is a scoped Datastore accessor that returns and accepts pandas or Polars DataFrames.

workouts = dsp.kind(
    schema=schema,
    client=client,
    namespace="tenant-a",
    filters=[("activity_type", "=", "run")],
    backend="pandas",
    read_only=False,
)

df = workouts.read(limit=1000)
plan = workouts.plan_write(df, skip_unchanged=True)
report = workouts.write(df, skip_unchanged=True)

Bound accessors can carry an ancestor scope:

user_workouts = workouts.with_ancestor(
    dsp.DatastoreKey(namespace="tenant-a", path=(("User", "sample-user"),))
)

When an ancestor is bound, writes are validated so generated keys must stay under that ancestor path and namespace.

For deterministic keys, key_policy is a convenience wrapper around KeySpec:

schema = dsp.Schema(
    kind="Workout",
    key=dsp.key_policy(
        "Workout",
        id_field="workout_id",
        namespace_field="tenant",
        ancestors=[("User", "user_id")],
    ),
    properties={...},
)

Accessors can also apply custom audit timestamp fields before planning/writing:

workouts = dsp.kind(
    schema=schema,
    client=client,
    audit=dsp.AuditPolicy(
        created_at="created_at",
        updated_at="updated_at",
        imported_at="imported_at",
    ),
)

Logical duplicate cleanup is explicit and dry-run by default:

plan = workouts.plan_duplicate_cleanup(
    by=["external_id"],
    order=["-updated_at"],
)

dry_report = workouts.cleanup_duplicates(
    by=["external_id"],
    order=["-updated_at"],
)

delete_report = workouts.cleanup_duplicates(
    by=["external_id"],
    order=["-updated_at"],
    dry_run=False,
)

This cleanup is for logical duplicates by property values. Datastore cannot store two entities with the same exact key.

DataFrame Models

dsp.dspdf(...) is a second layer over dsp.kind(...). It owns a current DataFrame and retains source context:

events = dsp.dspdf(
    kind="Workout",
    schema=schema,
    client=client,
    backend="pandas",
    namespace="tenant-a",
    keep_original=True,
)

events = events.load()
events.df["duration_sec"] = events.df["duration_sec"] + 60
report = events.write(skip_unchanged=True)

Derived or aggregated frames cannot write back to the source kind by accident:

summary = events.aggregate(
    by=["user_id"],
    metrics={"duration_sec": "sum"},
    target_schema=summary_schema,
)

summary.write_to(schema=summary_schema)

If a transformed DataFrame drops source keys, changes row shape, or is explicitly marked as derived, use with_target(...) or write_to(...) with a target schema.

Schema Inference

Datastore entities are typed, but a kind can still contain sparse and mixed-shape entities. The package can infer a conservative schema from sampled entities or a DataFrame:

report = dsp.infer_schema(kind="Event", client=client, sample_size=1000)
schema = report.schema
print(report.mixed_fields)

events = dsp.dspdf(kind="Event", client=client, infer_schema=True).load()

If one property has different value types across entities, the default mixed_type_policy="object" uses a pass-through field type and records the observed variants in the inference report. Use mixed_type_policy="error" to fail on mixed types, or "string" to coerce mixed values on writes.

Patching Partial DataFrames

Projection queries and sparse application workflows often produce partial DataFrames. Do not send those through replacement-style writes unless you intend to replace the full entity.

Use patch_datastore:

patch = df[["__key__", "notes", "last_reviewed_at"]]

dsp.patch_datastore(
    patch,
    schema=schema,
    properties=["notes", "last_reviewed_at"],
)

The current implementation uses read-merge-write so omitted properties are preserved. A lower-level Datastore Commit backend should eventually replace this for native property_mask support.

Transactions

Transactions are for small atomic workflows, not bulk ingestion:

with dsp.Transaction(client) as tx:
    row = tx.get(counter_key, schema=counter_schema)
    row["value"] += 1
    tx.put(row, schema=counter_schema)

Keep transactions small, retryable, and focused on read-modify-write logic. Bulk writes should use to_datastore.

Index Planning

plan_indexes provides conservative local guidance for composite index needs:

query = dsp.QuerySpec(
    kind="Workout",
    filters=[("activity_type", "=", "run"), ("started_at", ">=", start)],
    order=["started_at"],
)

plan = dsp.plan_indexes(query)
for suggestion in plan.suggestions:
    print(suggestion.to_index_yaml())

This is not a replacement for Datastore Query Explain. It is intended to catch common index shapes before runtime and generate a starting point for index.yaml.

Emulator Examples

The repository includes a Docker Compose setup that runs the Firestore emulator in Datastore mode and a complete sparse-data example suite:

docker compose -f examples\emulator\docker-compose.yml up --build

In a second terminal:

$env:DATASTORE_EMULATOR_HOST = "localhost:8081"
$env:DATASTORE_PROJECT_ID = "datastore-pandas-emulator"
python -m pip install -e ".[test,polars]"
python examples\emulator\run_all.py --rows 20000 --workers 8
python examples\emulator\run_all.py --backend polars --rows 20000 --workers 8

The emulator examples include:

  • generate_mock_data.py: creates heterogeneous swim/bike/run workout rows
  • load_mock_data.py: loads large DataFrames with batched concurrent writes
  • query_examples.py: demonstrates full reads, projections, keys-only queries, and distinct projections
  • patch_sparse_rows.py: patches a subset of properties without filling sparse entities with nulls
  • transaction_example.py: increments a counter transactionally
  • policy_examples.py: validates the instantiated accessor, dry-run/read-only write policies, skip-unchanged writes, audit fields, bound ancestor validation, and logical duplicate cleanup against the emulator
  • dataframe_model_examples.py: validates dspdf(...) source context, row-preserving write-back, derived aggregate target writes, and schema inference for mixed property types
  • edge_case_examples.py: focused executable checks for no-client dry-run and read-only planning, clear write reports, deterministic keys, bound ancestors, audit timestamp aliases, retry/batch behavior, injectable fake clients, skip-unchanged writes, duplicate cleanup, derived model targets, and mixed-type schema inference
  • inspect_sparse_entities.py: inspects raw entities to confirm sparse properties are omitted
  • index_planning.py: prints index suggestions
  • reset_emulator_data.py: clears sample entities from the emulator
  • public_divvy_ancestor_test.py: downloads public Divvy bike-share trip data, loads Dataset -> Station -> Ride ancestor paths, and validates ancestor, projection, and keys-only queries
  • large_linked_dataset.py: loads a synthetic linked-kind dataset with Tenant -> LinkedUser -> LinkedSession -> LinkedEvent ancestor paths and KeyType references across LinkedUser, LinkedDevice, LinkedSession, and LinkedEvent; the default local run uses 200,000 events, while --events 1000001 is available as a host-memory stress test

The main examples accept --backend pandas or --backend polars.

Full instructions are in the emulator examples README.

Release Checks

The project includes a local release validation helper:

python -m pip install -e ".[test,polars,release]"
python scripts\check_release.py

The helper checks that pyproject.toml, src/datastore_pandas/__init__.py, and CHANGELOG.md agree on the release version, then runs ruff, compileall, pytest, python -m build, and twine check --strict.

The GitHub publish workflow runs the same metadata/test checks before building and publishing distributions through PyPI Trusted Publishing. Publishing is triggered by a GitHub Release.

Type Mapping

Datastore concept Package type
integer Int64Type
double Float64Type
boolean BoolType
timestamp TimestampType
string StringType
blob BlobType
key KeyType / DatastoreKey
geo point GeoPointType / GeoPoint
array ArrayType(...)
embedded entity EmbeddedEntityType

Important conversion rules:

  • timestamps are normalized to UTC
  • Datastore timestamp precision is microseconds
  • integers are validated against signed 64-bit bounds
  • indexed strings and blobs must fit Datastore indexed-value limits
  • arrays cannot contain nested arrays
  • unindexed fields should be declared with indexed=False

Design Principles

  • Prefer Datastore-native operations over pretending arbitrary pandas operations can be pushed down.
  • Make schemas explicit for writes.
  • Treat keys as first-class identity values.
  • Omit nullable missing values by default to preserve sparse entities.
  • Use projection, keys-only, ancestor, and cursor-aware queries where appropriate.
  • Keep transactions explicit and small.
  • Use a SQL or analytical database for analytics, joins, and broad scans.

Repository Layout

src/datastore_pandas/
  accessor.py      instantiated kind accessor API
  audit.py         custom audit timestamp policies
  batches.py       batch planning and duplicate-key checks
  convert.py       row/entity conversion
  errors.py        package exceptions
  inference.py     schema inference and mixed-type reports
  io.py            read_datastore, iter_datastore, to_datastore, patch_datastore
  keys.py          DatastoreKey, KeySpec, KeyPart
  model.py         dspdf DataFrame-owning model layer
  planning.py      dry-run, read-only, and skip-unchanged write planning
  polars.py        optional Polars adapter
  query.py         QuerySpec and index planning
  reports.py       write result reporting
  schema.py        Schema and Field
  transaction.py   transaction context manager
  types.py         Datastore type converters

scripts/
  check_release.py release metadata, test, build, and twine-check helper

examples/
  basic_usage.py
  instantiated_accessor.py
  emulator/
    docker-compose.yml
    Dockerfile
    README.md
    edge_case_examples.py
    run_all.py

Limitations And Next Steps

Current limitations:

  • pytest and google-cloud-datastore must be installed locally to run the full test and emulator flow.
  • patch_datastore uses read-merge-write instead of native mutation property masks.
  • write reports do not yet include generated keys, entity versions, update times, or conflict details from lower-level mutation results.
  • compare-and-swap writes using base_version or update_time are not implemented yet.
  • aggregation queries and Query Explain are design targets but not implemented in the package API yet.
  • the index planner is conservative and should be validated against emulator and production Query Explain output.
  • the Polars adapter shares the same Datastore backend; transaction helpers still work with dictionaries rather than DataFrame-native transaction objects.

Useful next work:

  • add a lower-level Datastore Commit backend
  • support native property masks and conflict detection
  • add generated-key allocation and result mapping
  • add aggregation helpers such as count
  • add Query Explain integration
  • add live emulator integration tests in CI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datastore_pandas-0.2.1.tar.gz (64.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datastore_pandas-0.2.1-py3-none-any.whl (41.8 kB view details)

Uploaded Python 3

File details

Details for the file datastore_pandas-0.2.1.tar.gz.

File metadata

  • Download URL: datastore_pandas-0.2.1.tar.gz
  • Upload date:
  • Size: 64.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datastore_pandas-0.2.1.tar.gz
Algorithm Hash digest
SHA256 95f0f5ff5b8e4ee66c91bf54fc67538208cf9024da6512551035e8f012c4496f
MD5 12eb2a39fe9fc461dce8702d98fa66c2
BLAKE2b-256 649b8fc74c2395388899c135f53f0db4d557e137aff95f0e1c06033da1c6fe39

See more details on using hashes here.

Provenance

The following attestation bundles were made for datastore_pandas-0.2.1.tar.gz:

Publisher: publish.yml on gregsuniverse/datastore-pandas

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datastore_pandas-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for datastore_pandas-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f75d52d7a889cf08435965b5bdfb1960d538686f841889a6b11b1066bd72e488
MD5 7edba41fd60ed916913eee64693d5bf7
BLAKE2b-256 add571a4d8396a4a87456f8eb33b08ce44d9fc89ad24cb42d8ae0118791f3cb4

See more details on using hashes here.

Provenance

The following attestation bundles were made for datastore_pandas-0.2.1-py3-none-any.whl:

Publisher: publish.yml on gregsuniverse/datastore-pandas

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page