Skip to main content

Dataset SDK for consistent read/write [batch, online, streaming] data.

Project description

Tests Coverage Status Binder

Welcome to @datasets


import pandas as pd
from metaflow import FlowSpec, step

from datasets import Dataset, Mode
from datasets.metaflow import DatasetParameter
from datasets.plugins import BatchOptions

# Can also invoke from CLI:
#  > python datasets/tutorials/ run \
#    --hello_dataset '{"name": "HelloDataset", "mode": "READ_WRITE", \
#    "options": {"type": "BatchOptions", "partition_by": "region"}}'
class HelloDatasetFlow(FlowSpec):
    hello_dataset = DatasetParameter(
        default=Dataset("HelloDataset", mode=Mode.READ_WRITE, options=BatchOptions(partition_by="region")),

    def start(self):
        df = pd.DataFrame({"region": ["A", "A", "A", "B", "B", "B"], "zpid": [1, 2, 3, 4, 5, 6]})
        print("saving data_frame: \n", df.to_string(index=False))

        # Example of writing to a dataset

        # save this as an output dataset
        self.output_dataset = self.hello_dataset

    def end(self):
        print(f"I have dataset \n{self.output_dataset=}")

        # output_dataset to_pandas(partitions=dict(region="A")) only
        df: pd.DataFrame = self.output_dataset.to_pandas(partitions=dict(region="A"))

if __name__ == "__main__":

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zdatasets-0.2.5.tar.gz (54.7 kB view hashes)

Uploaded source

Built Distribution

zdatasets-0.2.5-py3-none-any.whl (84.6 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page