A pandas interface for Firestore in Datastore mode with schema-aware typing, key management, projections, batching, and transactions.
Project description
datastore-pandas
datastore-pandas is a schema-aware pandas interface for Firestore in Datastore
mode. It is designed for Datastore's real execution model: indexed entity queries,
key lookups, projections, cursor scans, transactions, and batched entity writes.
It does not try to turn Datastore into BigQuery. BigQuery can expose a broad pandas-like API because it has SQL, columnar execution, joins, aggregations, and server-side query planning. Datastore is an operational NoSQL entity store. This package keeps that boundary explicit so DataFrame workflows remain convenient without hiding Datastore's limits.
Current Status
This is an initial implementation scaffold. The code already covers the core shape of the package:
- explicit schema objects for safe DataFrame-to-entity conversion
- first-class Datastore key representation and row-to-key mapping
- sparse entity writes that omit missing DataFrame values by default
- typed conversion for strings, integers, floats, booleans, timestamps, blobs, keys, arrays, geo points, and embedded entities
- query construction for projections, keys-only queries, distinct projections, ancestor queries, filters, orderings, and limits
- batched writes with bounded concurrency
- read-merge-write patch updates for partial DataFrames
- a transaction helper for small read-modify-write workflows
- index-planning helpers that produce
index.yaml-style suggestions - emulator examples for local integration testing
The package is not yet a complete production client. The current high-level write
path relies on google-cloud-datastore; lower-level mutation support is still the
right next step for property masks, compare-and-swap writes, generated-key result
metadata, and conflict details.
License And Disclaimer
This repository uses The Unlicense, a public-domain-style dedication. The repository does not identify an owner and does not make ownership claims over the software.
Before using, copying, modifying, or relying on this repository, read the Disclaimer. In short: this project is experimental, provided without warranty, not recommended for any particular use, not guaranteed to be maintained, and includes AI-generated or AI-assisted software and documentation.
Installation
From this repository:
cd <repo>
python -m pip install -e ".[test]"
For local development against the emulator, no Google Cloud credentials are needed
when DATASTORE_EMULATOR_HOST is set. For real Datastore mode projects, configure
Application Default Credentials:
gcloud auth application-default login
Quick Start
import datastore_pandas as dsp
schema = dsp.Schema(
kind="Workout",
key=dsp.KeySpec([
("User", dsp.KeyPart("user_id", kind="name")),
("Workout", dsp.KeyPart("workout_id", kind="name")),
]),
properties={
"started_at": dsp.Field(dsp.TimestampType(), nullable=False),
"duration_sec": dsp.Field(dsp.Int64Type(), nullable=False),
"distance_m": dsp.Field(dsp.Float64Type()),
"activity_type": dsp.Field(dsp.StringType(), nullable=False),
"notes": dsp.Field(dsp.StringType(), indexed=False),
},
strict=True,
)
df = dsp.read_datastore(
kind="Workout",
schema=schema,
filters=[("activity_type", "=", "run")],
projection=["started_at", "duration_sec", "distance_m"],
order=["-started_at"],
include_key=True,
)
report = dsp.to_datastore(
df,
schema=schema,
mode="upsert",
batch_size=400,
max_workers=8,
)
report.raise_for_errors()
Why Schema Is Required For Writes
Datastore mode does not enforce one fixed schema per kind. Two entities of the same kind can have different property sets and different property types. pandas, however, rectangularizes data into columns. Without an explicit schema, a write adapter cannot safely tell whether a missing DataFrame cell means:
- the property should be omitted
- the property should be written as Datastore
null - the property is required and the row is invalid
- the property is an accidental extra column from another entity shape
datastore-pandas makes that decision explicit with Schema and Field.
schema = dsp.Schema(
kind="Document",
properties={
"title": dsp.Field(dsp.StringType(), nullable=False),
"summary": dsp.Field(dsp.StringType()),
"raw_text": dsp.Field(dsp.StringType(), indexed=False),
},
)
By default, nullable missing values are omitted:
encoded, excluded = schema.encode_properties({
"title": "present",
"summary": None,
})
assert encoded == {"title": "present"}
To intentionally write a Datastore null property, opt in:
dsp.Field(dsp.StringType(), missing_policy="null")
For required fields, use nullable=False. Missing or NA values will raise a
schema error before the row reaches Datastore.
Sparse Entities And DataFrames
Sparse entities are a first-order design case. For example, a Workout kind might
store swim, bike, and run entities together:
| Property | Swim | Bike | Run |
|---|---|---|---|
started_at |
yes | yes | yes |
duration_sec |
yes | yes | yes |
pool_length_m |
yes | no | no |
bike_power_w |
no | yes | no |
run_cadence_spm |
no | no | yes |
When these entities are read into pandas, the DataFrame must contain all columns,
so absent Datastore properties appear as NA. On write, those NA values should
not become stored null properties on every entity. The default missing_policy is
therefore omit.
This matters for correctness, index size, write cost, and query behavior.
Key Management
Keys are identity, not ordinary properties. datastore-pandas preserves the full
Datastore key shape:
- project
- database
- namespace
- ancestor path
- kind names
- string name IDs
- numeric IDs
- incomplete leaf keys for auto-ID allocation
Example:
key = dsp.DatastoreKey(
project="my-project",
namespace="tenant-a",
path=(
("User", "sample-user"),
("Workout", 123456789),
),
)
For DataFrame writes, KeySpec maps row columns to key path elements:
key = dsp.KeySpec(
[
("User", dsp.KeyPart("user_id", kind="name")),
("Workout", dsp.KeyPart("workout_id", kind="name")),
],
namespace_source="tenant",
)
The package keeps numeric IDs and string names distinct. 123 and "123" are
different Datastore keys.
Reading
Use read_datastore for normal DataFrame reads:
df = dsp.read_datastore(
kind="Workout",
schema=schema,
filters=[("activity_type", "=", "bike")],
order=["-started_at"],
limit=1000,
include_key=True,
)
Use projections to read only indexed properties:
df = dsp.read_datastore(
kind="Workout",
schema=schema,
projection=["started_at", "duration_sec", "distance_m"],
order=["-started_at"],
)
Use keys-only queries when planning deletes, existence checks, or staged fan-out lookups:
keys = dsp.read_datastore(
kind="Workout",
keys_only=True,
include_key=True,
)
Use iter_datastore for chunked processing:
for chunk in dsp.iter_datastore(kind="Workout", schema=schema, chunksize=1000):
process(chunk)
Writing
Use to_datastore for full entity writes:
report = dsp.to_datastore(
df,
schema=schema,
mode="upsert",
batch_size=400,
max_workers=8,
)
report.raise_for_errors()
The write path:
- validates rows against the schema
- builds Datastore keys from
KeySpecor__key__ - converts pandas values to Datastore-safe values
- omits nullable missing values by default
- excludes unindexed fields from indexes
- rejects duplicate complete keys in one commit
- chunks writes into bounded batches
insert and update modes are represented in the API, but full correctness for
those modes depends on the active Datastore batch backend exposing insert/update
methods. upsert is the safest path in the initial scaffold.
Patching Partial DataFrames
Projection queries and sparse application workflows often produce partial DataFrames. Do not send those through replacement-style writes unless you intend to replace the full entity.
Use patch_datastore:
patch = df[["__key__", "notes", "last_reviewed_at"]]
dsp.patch_datastore(
patch,
schema=schema,
properties=["notes", "last_reviewed_at"],
)
The current implementation uses read-merge-write so omitted properties are
preserved. A lower-level Datastore Commit backend should eventually replace this
for native property_mask support.
Transactions
Transactions are for small atomic workflows, not bulk ingestion:
with dsp.Transaction(client) as tx:
row = tx.get(counter_key, schema=counter_schema)
row["value"] += 1
tx.put(row, schema=counter_schema)
Keep transactions small, retryable, and focused on read-modify-write logic.
Bulk writes should use to_datastore.
Index Planning
plan_indexes provides conservative local guidance for composite index needs:
query = dsp.QuerySpec(
kind="Workout",
filters=[("activity_type", "=", "run"), ("started_at", ">=", start)],
order=["started_at"],
)
plan = dsp.plan_indexes(query)
for suggestion in plan.suggestions:
print(suggestion.to_index_yaml())
This is not a replacement for Datastore Query Explain. It is intended to catch
common index shapes before runtime and generate a starting point for index.yaml.
Emulator Examples
The repository includes a Docker Compose setup that runs the Firestore emulator in Datastore mode and a complete sparse-data example suite:
docker compose -f examples\emulator\docker-compose.yml up --build
In a second terminal:
$env:DATASTORE_EMULATOR_HOST = "localhost:8081"
$env:DATASTORE_PROJECT_ID = "datastore-pandas-emulator"
python -m pip install -e ".[test]"
python examples\emulator\run_all.py --rows 20000 --workers 8
The emulator examples include:
generate_mock_data.py: creates heterogeneous swim/bike/run workout rowsload_mock_data.py: loads large DataFrames with batched concurrent writesquery_examples.py: demonstrates full reads, projections, keys-only queries, and distinct projectionspatch_sparse_rows.py: patches a subset of properties without filling sparse entities with nullstransaction_example.py: increments a counter transactionallyinspect_sparse_entities.py: inspects raw entities to confirm sparse properties are omittedindex_planning.py: prints index suggestionsreset_emulator_data.py: clears sample entities from the emulatorpublic_divvy_ancestor_test.py: downloads public Divvy bike-share trip data, loadsDataset -> Station -> Rideancestor paths, and validates ancestor, projection, and keys-only queries
Full instructions are in examples/emulator/README.md.
Type Mapping
| Datastore concept | Package type |
|---|---|
| integer | Int64Type |
| double | Float64Type |
| boolean | BoolType |
| timestamp | TimestampType |
| string | StringType |
| blob | BlobType |
| key | KeyType / DatastoreKey |
| geo point | GeoPointType / GeoPoint |
| array | ArrayType(...) |
| embedded entity | EmbeddedEntityType |
Important conversion rules:
- timestamps are normalized to UTC
- Datastore timestamp precision is microseconds
- integers are validated against signed 64-bit bounds
- indexed strings and blobs must fit Datastore indexed-value limits
- arrays cannot contain nested arrays
- unindexed fields should be declared with
indexed=False
Design Principles
- Prefer Datastore-native operations over pretending arbitrary pandas operations can be pushed down.
- Make schemas explicit for writes.
- Treat keys as first-class identity values.
- Omit nullable missing values by default to preserve sparse entities.
- Use projection, keys-only, ancestor, and cursor-aware queries where appropriate.
- Keep transactions explicit and small.
- Use BigQuery or BigQuery DataFrames for analytics, joins, and broad scans.
Repository Layout
src/datastore_pandas/
batches.py batch planning and duplicate-key checks
convert.py row/entity conversion
errors.py package exceptions
io.py read_datastore, iter_datastore, to_datastore, patch_datastore
keys.py DatastoreKey, KeySpec, KeyPart
query.py QuerySpec and index planning
reports.py write result reporting
schema.py Schema and Field
transaction.py transaction context manager
types.py Datastore type converters
examples/
basic_usage.py
emulator/
docker-compose.yml
Dockerfile
README.md
run_all.py
Limitations And Next Steps
Current limitations:
pytestandgoogle-cloud-datastoremust be installed locally to run the full test and emulator flow.patch_datastoreuses read-merge-write instead of native mutation property masks.- write reports do not yet include generated keys, entity versions, update times, or conflict details from lower-level mutation results.
- compare-and-swap writes using
base_versionorupdate_timeare not implemented yet. - aggregation queries and Query Explain are design targets but not implemented in the package API yet.
- the index planner is conservative and should be validated against emulator and production Query Explain output.
Useful next work:
- add a lower-level Datastore
Commitbackend - support native property masks and conflict detection
- add generated-key allocation and result mapping
- add aggregation helpers such as
count - add Query Explain integration
- add live emulator integration tests in CI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datastore_pandas-0.1.0.tar.gz.
File metadata
- Download URL: datastore_pandas-0.1.0.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0027b374b4d3a192dedb5011e2dfa358516b28a88bba2afee900677c5ef0cec
|
|
| MD5 |
1334e517ee07ef08ab0cd50fdab9e6e0
|
|
| BLAKE2b-256 |
8a7d25242b122138b595860e1006a153428f98ceeb2b5bc631c384d2a0776336
|
Provenance
The following attestation bundles were made for datastore_pandas-0.1.0.tar.gz:
Publisher:
publish.yml on gregsuniverse/datastore-pandas
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datastore_pandas-0.1.0.tar.gz -
Subject digest:
f0027b374b4d3a192dedb5011e2dfa358516b28a88bba2afee900677c5ef0cec - Sigstore transparency entry: 1606659337
- Sigstore integration time:
-
Permalink:
gregsuniverse/datastore-pandas@b84c95c070c8d51588536951c7559d6a45b4e042 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/gregsuniverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b84c95c070c8d51588536951c7559d6a45b4e042 -
Trigger Event:
release
-
Statement type:
File details
Details for the file datastore_pandas-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datastore_pandas-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19f6cf1eaf761ef7a695ea7853e3259896b17b5d802a1c281b0e1498044a3ebd
|
|
| MD5 |
6606ced501f17215b79910c765988c80
|
|
| BLAKE2b-256 |
aa6e21a8723a99e28e365d2b123c8028cf0d7b7fef014c3d83c7a700603b4b91
|
Provenance
The following attestation bundles were made for datastore_pandas-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on gregsuniverse/datastore-pandas
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datastore_pandas-0.1.0-py3-none-any.whl -
Subject digest:
19f6cf1eaf761ef7a695ea7853e3259896b17b5d802a1c281b0e1498044a3ebd - Sigstore transparency entry: 1606659396
- Sigstore integration time:
-
Permalink:
gregsuniverse/datastore-pandas@b84c95c070c8d51588536951c7559d6a45b4e042 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/gregsuniverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b84c95c070c8d51588536951c7559d6a45b4e042 -
Trigger Event:
release
-
Statement type: