Skip to main content

pympipool - Scale serial and MPI-parallel python functions over hundreds of compute nodes all from within a jupyter notebook or serial python process.

Project description

pympipool - scale python functions over multiple compute nodes

Unittests Coverage Status

Up-scaling python functions for high performance computing (HPC) can be challenging. While the python standard library provides interfaces for multiprocessing and asynchronous task execution, namely multiprocessing and concurrent.futures both are limited to the execution on a single compute node. So a series of python libraries have been developed to address the up-scaling of python functions for HPC. Starting in the datascience and machine learning community with solutions like dask over more HPC focused solutions like parsl up to Python bindings for the message passing interface (MPI) named mpi4py. Each of these solutions has their advantages and disadvantages, in particular the mixing of MPI parallel python functions and serial python functions in combined workflows remains challenging.

To address these challenges pympipool is developed with three goals in mind:

  • Reimplement the standard python library interfaces namely multiprocessing.pool.Pool and concurrent.futures.Executor as closely as possible, to minimize the barrier of up-scaling an existing workflow to be used on HPC resources.
  • Integrate MPI parallel python functions based on mpi4py on the same level as serial python functions, so both can be combined in a single workflow. This allows the users to parallelize their workflows one function at a time. Internally this is achieved by coupling a serial python process to a MPI parallel python process.
  • Embrace Jupyter notebooks for the interactive development of HPC workflows, as they allow the users to document their though process right next to the python code and their results all within one document.

Features

As different users and different workflows have different requirements in terms of the level of parallelization, the pympipool implements a series of five different interfaces:

  • pympipool.Pool: Following the multiprocessing.pool.Pool the pympipool.Pool class implements the map() and starmap() functions. Internally these connect to an MPI parallel subprocess running the mpi4py.futures.MPIPoolExecutor. So by increasing the number of workers, by setting the max_workers parameter the pympipool.Pool can scale the execution of serial python functions beyond a single compute node. For MPI parallel python functions the pympipool.MPISpawnPool is derived from the pympipool.Pool and uses MPI_Spawn() to execute those. For more details see below.
  • pympipool.Executor: The easiest way to execute MPI parallel python functions right next to serial python functions is the pympipool.Executor. It implements the executor interface defined by the concurrent.futures.Executor. So functions are submitted to the pympipool.Executor using the submit() function, which returns an concurrent.futures.Future object. With these concurrent.futures.Future objects asynchronous workflows can be constructed which periodically check if the computation is completed done() and then query the results using the result() function. The limitation of the pympipool.Executor is lack of load balancing, each pympipool.Executor acts as a serial first in first out (FIFO) queue. So it is the task of the user to balance the load of many different tasks over multiple pympipool.Executor instances.
  • pympipool.HPCExecutor: To address the limitation of the pympipool.Executor that only a single task is executed at any time, the pympipool.HPCExecutor provides a wrapper around multiple pympipool.Executor objects. It balances the queues of the individual pympipool.Executor objects to maximize the throughput for the given resources. This functionality comes with an additional overhead of another thread, acting as a broker between the task queue of the pympipool.HPCExecutor and the individual pympipool.Executor objects.
  • pympipool.PoolExecutor: To combine the functionality of the pympipool.Pool and the pympipool.Executor the pympipool.PoolExecutor again connects to the mpi4py.futures.MPIPoolExecutor. Still in contrast to the pympipool.Pool it does not implement the map() and starmap() functions but rather the submit() function based on the concurrent.futures.Executor interface. In this case the load balancing happens internally and the maximum number of workers max_workers defines the maximum number of parallel tasks. But only serial python tasks can be executed in contrast to the pympipool.Executor which can also execute MPI parallel python tasks.
  • pympipool.MPISpawnPool: An alternative way to support MPI parallel functions in addition to the pympipool.Executor is the pympipool.MPISpawnPool. Just like the pympipool.Pool it supports the map() and starmap() functions. The additional ranks_per_task parameter defines how many MPI ranks are used per task. All functions are executed with the same number of MPI ranks. The limitation of this approach is that it uses MPI_Spawn() to create new MPI ranks for the execution of the individual tasks. Consequently, this approach is not as scalable as the pympipool.Executor but it offers load balancing for a large number of similar MPI parallel tasks.
  • pympipool.SocketInterface: The key functionality of the pympipool package is the coupling of a serial python process with an MPI parallel python process. This happens in the background using a combination of the zero message queue and cloudpickle to communicate binary python objects. The pympipool.SocketInterface is an abstraction of this interface, which is used in the other classes inside pympipool and might also be helpful for other projects.

In addition to using MPI to start a number of processes on different HPC computing resources, pympipool also supports the flux-framework as additional backend. By setting the optional enable_flux_backend parameter to True the flux backend can be enabled for the pympipool.Pool, pympipool.Executor and pympipool.PoolExecutor. Other optional parameters include the selection of the working directory where the python function should be executed cwd and the option to oversubscribe MPI tasks which is an OpenMPI specific feature which can be enabled by setting oversubscribe to True. For more details on the pympipool classes and their application, the extended documentation is linked below.

Documentation

License

pympipool is released under the BSD license https://github.com/pyiron/pympipool/blob/main/LICENSE . It is a spin-off of the pyiron project https://github.com/pyiron/pyiron therefore if you use pympipool for calculation which result in a scientific publication, please cite:

@article{pyiron-paper,
  title = {pyiron: An integrated development environment for computational materials science},
  journal = {Computational Materials Science},
  volume = {163},
  pages = {24 - 36},
  year = {2019},
  issn = {0927-0256},
  doi = {https://doi.org/10.1016/j.commatsci.2018.07.043},
  url = {http://www.sciencedirect.com/science/article/pii/S0927025618304786},
  author = {Jan Janssen and Sudarsan Surendralal and Yury Lysogorskiy and Mira Todorova and Tilmann Hickel and Ralf Drautz and Jörg Neugebauer},
  keywords = {Modelling workflow, Integrated development environment, Complex simulation protocols},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pympipool-0.7.3.tar.gz (44.6 kB view hashes)

Uploaded Source

Built Distribution

pympipool-0.7.3-py3-none-any.whl (28.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page