Skip to main content

Dataset SDK for consistent read/write [batch, online, streaming] data.

Project description

Tests Coverage Status Binder

Welcome to zdatasets

Development

  • Set the version to a dev version, e.g. 1.3.0.dev1 in pyproject.toml when starting development.
  • Bump the dev version (e.g., 1.3.0.dev1 → 1.3.0.dev2) every time you have a change you want to test in other repositories.
  • After every change, confirm that the github workflow runs are successful at https://github.com/zillow/zdatasets/actions.
  • The dev versions are published in test PyPI at https://test.pypi.org/project/zdatasets/#history.
  • While testing your changes, you may need to reference your merge request in other repositories' pyproject.toml instead of using the dev version. For example,
dataset = [
  "zdatasets[kubernetes] @ git+https://github.com/zillow/zdatasets.git@refs/pull/42/head"
]

Example

import pandas as pd
from metaflow import FlowSpec, step

from zdatasets import Dataset, Mode
from zdatasets.metaflow import DatasetParameter
from zdatasets.plugins import BatchOptions


# Can also invoke from CLI:
#  > python zdatasets/tutorials/0_hello_dataset_flow.py run \
#    --hello_dataset '{"name": "HelloDataset", "mode": "READ_WRITE", \
#    "options": {"type": "BatchOptions", "partition_by": "region"}}'
class HelloDatasetFlow(FlowSpec):
    hello_dataset = DatasetParameter(
        "hello_dataset",
        default=Dataset("HelloDataset", mode=Mode.READ_WRITE, options=BatchOptions(partition_by="region")),
    )

    @step
    def start(self):
        df = pd.DataFrame({"region": ["A", "A", "A", "B", "B", "B"], "zpid": [1, 2, 3, 4, 5, 6]})
        print("saving data_frame: \n", df.to_string(index=False))

        # Example of writing to a dataset
        self.hello_dataset.write(df)

        # save this as an output dataset
        self.output_dataset = self.hello_dataset

        self.next(self.end)

    @step
    def end(self):
        print(f"I have dataset \n{self.output_dataset=}")

        # output_dataset to_pandas(partitions=dict(region="A")) only
        df: pd.DataFrame = self.output_dataset.to_pandas(partitions=dict(region="A"))
        print('self.output_dataset.to_pandas(partitions=dict(region="A")):')
        print(df.to_string(index=False))


if __name__ == "__main__":
    HelloDatasetFlow()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zdatasets-1.3.0.tar.gz (55.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zdatasets-1.3.0-py3-none-any.whl (86.0 kB view details)

Uploaded Python 3

File details

Details for the file zdatasets-1.3.0.tar.gz.

File metadata

  • Download URL: zdatasets-1.3.0.tar.gz
  • Upload date:
  • Size: 55.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zdatasets-1.3.0.tar.gz
Algorithm Hash digest
SHA256 a5376c92a53d28a96832b2ecffb652d52b0c08f6966f97dfeb070ef62984b476
MD5 6b481ed7d4791a06a2533fc91b55cde3
BLAKE2b-256 fafaa4e4a63d421909eeae8659e90383a67e06b216234b5c59f5fd1927254881

See more details on using hashes here.

File details

Details for the file zdatasets-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: zdatasets-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 86.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zdatasets-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d66905334358b8f7d2ea30b5853ee1ba1627891ca1281fbbd3dcc3f22e892abd
MD5 494df7c920816716539d19ece86ca8ec
BLAKE2b-256 3f6dd654a34ca9225c0e8cdb00918cecf51ff9e1b930dac815693039501bfe77

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page