Dataset SDK for consistent read/write [batch, online, streaming] data.
Project description
Welcome to zdatasets
Development
- Set the version to a dev version, e.g.
1.3.0.dev1inpyproject.tomlwhen starting development. - Bump the dev version (e.g., 1.3.0.dev1 → 1.3.0.dev2) every time you have a change you want to test in other repositories.
- After every change, confirm that the github workflow runs are successful at https://github.com/zillow/zdatasets/actions.
- The dev versions are published in test PyPI at https://test.pypi.org/project/zdatasets/#history.
- While testing your changes, you may need to reference your merge request in other repositories'
pyproject.tomlinstead of using the dev version. For example,
dataset = [
"zdatasets[kubernetes] @ git+https://github.com/zillow/zdatasets.git@refs/pull/42/head"
]
- Bump the release version (e.g., 1.3.0.dev2 → 1.3.1) before merging your code change.
- Confirm the release of the new version in PyPI at https://pypi.org/project/zdatasets/#history.
- Create the release in https://github.com/zillow/zdatasets/releases.
- For any authentication issues in publishing to PyPI, ask for help in the #open-source slack channel.
Example
import pandas as pd
from metaflow import FlowSpec, step
from zdatasets import Dataset, Mode
from zdatasets.metaflow import DatasetParameter
from zdatasets.plugins import BatchOptions
# Can also invoke from CLI:
# > python zdatasets/tutorials/0_hello_dataset_flow.py run \
# --hello_dataset '{"name": "HelloDataset", "mode": "READ_WRITE", \
# "options": {"type": "BatchOptions", "partition_by": "region"}}'
class HelloDatasetFlow(FlowSpec):
hello_dataset = DatasetParameter(
"hello_dataset",
default=Dataset("HelloDataset", mode=Mode.READ_WRITE, options=BatchOptions(partition_by="region")),
)
@step
def start(self):
df = pd.DataFrame({"region": ["A", "A", "A", "B", "B", "B"], "zpid": [1, 2, 3, 4, 5, 6]})
print("saving data_frame: \n", df.to_string(index=False))
# Example of writing to a dataset
self.hello_dataset.write(df)
# save this as an output dataset
self.output_dataset = self.hello_dataset
self.next(self.end)
@step
def end(self):
print(f"I have dataset \n{self.output_dataset=}")
# output_dataset to_pandas(partitions=dict(region="A")) only
df: pd.DataFrame = self.output_dataset.to_pandas(partitions=dict(region="A"))
print('self.output_dataset.to_pandas(partitions=dict(region="A")):')
print(df.to_string(index=False))
if __name__ == "__main__":
HelloDatasetFlow()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
zdatasets-1.3.0.tar.gz
(55.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
zdatasets-1.3.0-py3-none-any.whl
(86.0 kB
view details)
File details
Details for the file zdatasets-1.3.0.tar.gz.
File metadata
- Download URL: zdatasets-1.3.0.tar.gz
- Upload date:
- Size: 55.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5376c92a53d28a96832b2ecffb652d52b0c08f6966f97dfeb070ef62984b476
|
|
| MD5 |
6b481ed7d4791a06a2533fc91b55cde3
|
|
| BLAKE2b-256 |
fafaa4e4a63d421909eeae8659e90383a67e06b216234b5c59f5fd1927254881
|
File details
Details for the file zdatasets-1.3.0-py3-none-any.whl.
File metadata
- Download URL: zdatasets-1.3.0-py3-none-any.whl
- Upload date:
- Size: 86.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d66905334358b8f7d2ea30b5853ee1ba1627891ca1281fbbd3dcc3f22e892abd
|
|
| MD5 |
494df7c920816716539d19ece86ca8ec
|
|
| BLAKE2b-256 |
3f6dd654a34ca9225c0e8cdb00918cecf51ff9e1b930dac815693039501bfe77
|