Skip to main content

Minimal task scheduling abstraction

Project description

Build Status Coverage status Documentation Status Join the chat at https://gitter.im/dask/dask Version Status Downloads Dask Example Notebooks

Dask provides multi-core execution on larger-than-memory datasets using blocked algorithms and task scheduling. It maps high-level NumPy, Pandas, and list operations on large datasets on to many operations on small in-memory datasets. It then executes these graphs in parallel on a single machine with the multiprocessing and multithreaded scheduler, and on many machines with the distributed scheduler. New schedulers can be written or adapted in a straightforward manner as well. Dask lets us use traditional NumPy, Pandas, and list programming while operating on inconveniently large data in a small amount of space.

  • dask is a specification to describe task dependency graphs.

  • dask.array is a drop-in NumPy replacement (for a subset of NumPy) that encodes blocked algorithms in dask dependency graphs.

  • dask.bag encodes blocked algorithms on Python lists of arbitrary Python objects.

  • dask.dataframe encodes blocked algorithms on Pandas DataFrames.

  • dask.async is a shared-memory asynchronous scheduler efficiently execute dask dependency graphs on multiple cores.

See full documentation at http://dask.pydata.org. Read developer-focused blogposts about dask’s development. Or try dask in your browser with example notebooks on Binder.

Use dask.array

Dask.array implements a numpy clone on larger-than-memory datasets using multiple cores.

>>> import dask.array as da

>>> x = da.random.normal(10, 0.1, size=(100000, 100000), chunks=(1000, 1000))

>>> x.mean(axis=0)[:3].compute()
array([ 10.00026926,  10.0000592 ,  10.00038236])

Use dask.dataframe

Dask.dataframe implements a Pandas clone on larger-than-memory datasets using multiple cores.

>>> import dask.dataframe as dd
>>> df = dd.read_csv('nyc-taxi-*.csv.gz')

>>> g = df.groupby('medallion')
>>> g.trip_time_in_secs.mean().head(5)
medallion
0531373C01FD1416769E34F5525B54C8     795.875026
867D18559D9D2941173AD7A0F3B33E77     924.187954
BD34A40EDD5DC5368B0501F704E952E7     717.966875
5A47679B2C90EA16E47F772B9823CE51     763.005149
89CE71B8514E7674F1C662296809DDF6     869.274052
Name: trip_time_in_secs, dtype: float64

Use dask.bag

Dask.bag implements a large collection of Python objects and mimicking the toolz interface

>>> import dask.bag as db
>>> import json
>>> b = db.from_filenames('2014-*.json.gz')
...       .map(json.loads)

>>> alices = b.filter(lambda d: d['name'] == 'Alice')
>>> alices.take(3)
({'name': 'Alice', 'city': 'LA',  'balance': 100},
 {'name': 'Alice', 'city': 'LA',  'balance': 200},
 {'name': 'Alice', 'city': 'NYC', 'balance': 300},

>>> dict(alices.pluck('city').frequencies())
{'LA': 10000, 'NYC': 20000, ...}

Use Dask Graphs

Dask.array, dask.dataframe, and dask.bag are thin layers on top of dask graphs, which represent computational task graphs of regular Python functions on regular Python objects.

As an example consider the following simple program:

def inc(i):
    return i + 1

def add(a, b):
    return a + b

x = 1
y = inc(x)
z = add(y, 10)

We encode this computation as a dask graph in the following way:

d = {'x': 1,
     'y': (inc, 'x'),
     'z': (add, 'y', 10)}

A dask graph is just a dictionary of tuples where the first element of the tuple is a function and the rest are the arguments for that function. While this representation of the computation above may be less aesthetically pleasing, it may now be analyzed, optimized, and computed by other Python code, not just the Python interpreter.

A simple dask dictionary

Install

Dask is easily installable through your favorite Python package manager:

conda install dask

or

pip install dask[array]
or
pip install dask[bag]
or
pip install dask[dataframe]
or
pip install dask[complete]

Dependencies

dask.core supports Python 2.6+ and Python 3.3+ with a common codebase. It is pure Python and requires no dependencies beyond the standard library. It is a light weight dependency.

dask.array depends on numpy.

dask.bag depends on toolz and cloudpickle.

Examples

Dask examples are available in the following repository: https://github.com/dask/dask-examples.

You can also find them in Anaconda.org: https://notebooks.anaconda.org/dask/.

LICENSE

New BSD. See License File.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask-0.8.1.tar.gz (292.0 kB view hashes)

Uploaded Source

Built Distribution

dask-0.8.1.macosx-10.5-x86_64.tar.gz (375.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page