Skip to main content

daskqueue distributed queue package

Project description

daskqueue

licence issues code style

daskqueue is small python library built on top of Dask and Dask Distributed that implements a very lightweight Distributed Task queue.

Think of this library as a simpler version of Celery built entirely on Dask. Running Celery on HPC environnment (for instance) is usually very tricky whereas spawning a Dask Cluster is a lot easier to manage, debug and cleanup.

Motivation

Dask is an amazing library for parallel computing written entirely in Python. It is easy to install and offer both a high level API wrapping common collections (arrays, bags, dataframes) and a low level API for written custom code (Task graph with Delayed and Futures).

For all its greatness, Dask implements a central scheduler (basically a simple tornado eventloop) involved in every decision, which can sometimes create a central bottleneck. This is a pretty serious limitation when trying use Dask in high throughput situation. A simple Task Queue is usually the best approach when trying to distribute millions of tasks.

The daskqueue python library leverages Dask Actors to implement distributed queues with a simple load balancer QueuePool and a Consummer class to consumme message from these queues.

We used Actors because:

  • Actors are stateful, they can hold on to and mutate state. They are allowed to update their state in-place, which is very useful when spawning distributed queues !

  • NO CENTRAL SCHEDULING NEEDED : Operations on actors do not inform the central scheduler, and so do not contribute to the 4000 task/second overhead. They also avoid an extra network hop and so have lower latencies. Actors can communicate between themselves in a P2P manner, which makes it pretty neat when having a huge number of queues and consummers.

Note : Dask provides a Queue implementation but they are mediated by the central scheduler, and so they are not ideal for sending large amounts of data (everything you send will be routed through a central point) and add additionnal overhead on the scheduler when trying to put millions of tasks.

Install

daskqueue requires Python 3.6 or newer. You can install manually by cloning the repository:

$ git clone https://github.com/AmineDiro/daskqueue.git
$ cd daskqueue/
$ pip install .

Usage

This simple example show how to copy files in parallel using Dask workers and a distributed queue:

from distributed import Client, Queue
from daskqueue import Consumer, QueuePool

def get_random_msg(start_dir:str,list_files:List[str],size:int)->List[Tuple[str,str]]:
    pass

class CopyWorker(ConsumerBaseClass):
    ## You should always implement a concrete `process_item` where you define your processing code.
    # Take a look at the Implementation Section
    def process_item(self,item):
        src, dst = item
        shutil.copy(src,dst)

if __name__ == "__main__":
    client = Client(address="scheduler_address")

    # Params
    n_queues = 5
    n_consummers = 20
    start_dir = ""
    dst_dir= ""

    # Create a distributed queue on the cluster
    queue_pool = client.submit(QueuePool, n_queues, actor=True).result()

    # Start Consummer Pool
    consumer_pool = ConsumerPool(client, CopyWorker, n_consumers, queue_pool)
    consumer_pool.start()

    # Parallel file copy
    l_files = os.listdir(start_dir)

    ## Put work item on the queue
    for _ in range(100):
        msg = get_random_msg(start_dir,l_files,size=1000)
        queue_pool.put_many(msg)

    consumer_pool.join()

Take a look at the examples/ folder to get some usage.

Implementation

You should think of daskqueue as a very simple distributed version of aiomultiprocessing. We have three basic classes:

  • QueueActor: Wraps a simple AsyncIO Queue object in a Dask Actor, providing an interface for putting and getting item in a distributed AND asynchronous fashion. Each queue runs in a separate Dask Worker and can interface with different actors in the cluster.

  • QueuePool: Basic Pool actor, it holds a reference to queues and their sizes. It interfaces with the Client and the Consummers. The QueuePool implements a simple scheduling on put and get :

    • On put : arbitrarly chooses a random queue and puts item into it then update the queue size reference
    • On get_max_queue : returns a queue with the highest item count then updates the queue size reference
  • ConsumerBaseClass: Abstract class interfaces implementing all the fetching logic for you worker. You should build your own workers by inheriting from this class then spawning them in your Dask cluster. The Consumers have a start() method where we run an async while True loop to get a queue reference from the QueuePool then directly communicate with the Queue providing highly scalable workflows. The Consummer will then get an item form the queue and schedule process_item on the dask worker's ThreadPoolExecutor, freeing the worker's eventloop to communicate with the scheduler, fetch tasks asynchronously etc ....

Performance and Limitations

The daskqueue library is very well suited for IO bound jobs: by running multiple consummers and queues, communication asynchronously, we can bypass the dask scheduler limit and process **millions of tasks 🥰 !! **

The example copy code above was ran on cluster of 20 consummers and 5 queues. The tasks ran are basic file copy between two location (copying form NFS filer). We copied 200 000 files (~ 1.1To) without ever breaking a sweat !

We can clearly see the network saturation:

Image

Looking at the scheduler metrics, we can have a mean of 19.3% Image

As for the limitation, given the current implementation, you should be mindfull of the following limitations (this list will be updated regularly):

  • The workers don't implement a min or max tasks fetched and scheduled on the eventloop, they will continuously fetch an item, process it etc...
  • We run the tasks in the workers ThreadPool, we inherit all the limitations that the standard dask.submit method have.
  • Task that require multiprocessing/multithreading within a worker cannot be scheduled at the time, although this is something we are currently working on implementing
  • The QueuePool implement simple scheduling on put and get. Alternative schedulers may assign jobs to queues using arbitrary criteria, but no other scheduler implementation is available at this time for QueuePool.

TODO

  • Consumer should run arbitrary funcs (ala celery)
  • Use Worker's thread pool for long running tasks ( probe finished to get results)
  • CI/CD
  • Implement reliability : tasks retries, acks mechanisms ... ?
  • Implement a Distributed Join to know when to stop cluster
  • Implement a concurrency_limit as the maximum number of active, concurrent jobs each worker process will pick up from its queue at once.
  • Run the tasks in any WorkerPluging executor specified
  • Implement the various Queue Exceptions
  • Wrap consummers in a Consummers class
  • Bypass Queue mechanism by using zeroMQ ?
  • Tests

Contributing

Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. This project is still very very rough! Any contributions you make will benefit everybody else and are greatly appreciated 😍 😍 😍 !

Please try to create bug reports that are:

  • Reproducible. Include steps to reproduce the problem.
  • Specific. Include as much detail as possible: which version, what environment, etc.
  • Unique. Do not duplicate existing opened issues.
  • Scoped to a Single Bug. One bug per issue.

License

daskqueue is copyright Amine Dirhoussi, and licensed under the MIT license. I am providing code in this repository to you under an open source license. This is my personal repository; the license you receive to my code is from me and not from my employer. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

daskqueue-0.1.1.tar.gz (14.0 kB view hashes)

Uploaded Source

Built Distribution

daskqueue-0.1.1-py3-none-any.whl (11.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page