GroupBy operations for dask.array
Project description
flox
This project explores strategies for fast GroupBy reductions with dask.array. It used to be called dask_groupby
This repo explores strategies for a distributed GroupBy with dask arrays. It was motivated by
(See a presentation about this package).
Acknowledgements
This work was funded in part by NASA-ACCESS 80NSSC18M0156 "Community tools for analysis of NASA Earth Observing System Data in the Cloud" (PI J. Hamman), and NCAR's Earth System Data Science Initiative. It was motivated by many discussions in the Pangeo community.
API
There are three functions
flox.groupby_reduce(dask_array, by_dask_array, "mean")
"pure" dask array interfaceflox.xarray.xarray_reduce(xarray_object, by_dataarray, "mean")
"pure" xarray interface
Implementation
The core GroupBy operation is outsourced to
numpy_groupies. The GroupBy
reduction is first applied blockwise. Those intermediate results are
combined by concatenating to form a new array which is then reduced
again. The combining of intermediate results uses dask's _tree_reduce
till all group results are in one block. At that point the result is
"finalized" and returned to the user. Here is an example of writing a
custom Aggregation (again inspired by dask.dataframe)
mean = Aggregation(
# name used for dask tasks
name="mean",
# blockwise reduction
chunk=("sum", "count"),
# combine intermediate results: sum the sums, sum the counts
combine=("sum", "sum"),
# generate final result as sum / count
finalize=lambda sum_, count: sum_ / count,
# Used when "reindexing" at combine-time
fill_value=0,
)
Using _tree_reduce
complicates the implementation. An
alternative simpler implementation would be to use the "tensordot"
trick.
But this requires knowledge of "expected group labels" at
compute-time.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file flox-0.2.0.tar.gz
.
File metadata
- Download URL: flox-0.2.0.tar.gz
- Upload date:
- Size: 50.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbcab816249795d7a23e42b37f39c2ff04bf789c768a1a0a2551fe9d3d6c20fd |
|
MD5 | bc1100443d121874662add0bf1b33903 |
|
BLAKE2b-256 | a9371307157823037363818f71d9a54e825058f12d741c313a7ad6ecc558025a |
File details
Details for the file flox-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: flox-0.2.0-py3-none-any.whl
- Upload date:
- Size: 41.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c990d2f23cd37e1398472366fce504f0b6216419b192d117348eed523b2e9844 |
|
MD5 | 134cb7842ba52f70f1cf01176759b6f3 |
|
BLAKE2b-256 | 6d31167dac1fabaffcfcc3de8aadc1d5846bf133da1780b79c52b02518ff8094 |