Data-level parallelism with mpi in python
dlp_mpi - Data-level parallelism with mpi for python
|Run an serial algorithm on multiple examples||Use dlp_mpi to run the loop body in parallel||Use dlp_mpi to run a function in parallel|
# python script.py import time examples = list(range(10)) results =  for example in examples: # Some heavy workload: # CPU or IO time.sleep(0.2) result = example # Remember the results results.append(result) # Summarize your experiment print(sum(results))
# mpiexec -np 8 python script.py import time import dlp_mpi examples = list(range(10)) results =  for example in dlp_mpi.split_managed( examples): # Some heavy workload: # CPU or IO time.sleep(0.2) result = example # Remember the results results.append(result) results = dlp_mpi.gather(results) if dlp_mpi.IS_MASTER: results = [ result for worker_results in results for result in worker_results ] # Summarize your experiment print(results)
# mpiexec -np 8 python script.py import time import dlp_mpi examples = list(range(10)) results =  def work_load(example): # Some heavy workload: # CPU or IO time.sleep(0.2) result = example for result in dlp_mpi.map_unordered( work_load, examples): # Remember the results results.append(result) if dlp_mpi.IS_MASTER: # Summarize your experiment print(results)
This package uses
mpi4py to provide utilities to parallize algorithms that are applied to multiple examples.
The core idea is: Start
N processes and each process works on a subset of all examples.
To start the processes
mpiexec can be used. Most HPC systems support MPI to scatter the workload across multiple hosts. For the command, look in the documentation for your HPC system and search for MPI launches.
Since each process should operate on different examples, MPI provides the variables
SIZE is the number of workers and
RANK is a unique identifier from
SIZE - 1.
The simplest way to improve the execution time is to process
examples[RANK::SIZE] on each worker.
This is a round robin load balancing (
An more advanced load balaning is
dlp_mpi.split_managed, where one process manages the load and assigns a new task to a worker, once he finishes the last task.
When in the end of a program all results should be summariesd or written in a single file, comunication between all processes is nessesary.
For this purpose
mpi4py.MPI.COMM_WORLD.gather) can be used. This function sends all data to the root process (For serialisation is
As alternative to splitting the data, this package also provides a
map style parallelization (see example in the beginning):
work_load in parallel and executes the
for body in serial.
The communication between the processes is only the
result and the index to get the
ith example from the examples. i.e.: The example aren't transferred between the processes.
Without this package your code runs serial.
The execution time of the following code snippets will demonstrated, how it runs with this package.
Regarding the color: The
examples = ... is the setup code.
Therefore it is blue in the code and the block that represents the execution time is also blue.
This simples way to paralize the workload (dark orange) is to do an round robin assignment of the load:
for example in dlp_mpi.split_round_robin(examples).
This function call is equivalent to
for example in examples[dlp_mpi.RANK::dlp_mpi.SIZE].
So there is zero comunications between the workers.
Only when it is nessesary to do some final work on the results of all data (e.g. calculating average metrics) a comunication is nessesary.
This is done with the
This functions returns the worker results in an list on the master process and the worker process gets a
None return value.
Depending on the workload the round robin assingment can be suboptimal.
See the example block diagramm.
Worker 1 got tasks that are relative long.
So this worker used much more time than the others.
To overcome the limitations of the round robin assingment, this package propose to use a manager to assign the work to the workers. This optimizes the utilisation of the workers. Once a worker finished an example, he request a new one from the manager and gets one assigned. Note: The comunication is only which example should be processed (i.e. the index of the example) not the example itself.
An alternative to splitting the iterator is to use a
The function is the executed on the workers and the return value is send back to the manager.
Be carefull, that the loop body is fast enough, otherwise it can be a bottleneck.
When a worker sends a task to the manager, the manager sends back a new task and enter the for loop body.
While the manager is in the loop body, he cannot react on requests of other workers, see the block diagramm:
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size dlp_mpi-0.0.2-py3-none-any.whl (12.0 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size dlp_mpi-0.0.2.tar.gz (14.5 kB)||File type Source||Python version None||Upload date||Hashes View|