Skip to main content

BigMPI4py: Python module for parallelization of Big Data objects

Project description

# BigMPI4py

BigMPI4py is a module developed based on Lisando Dalcin’s implementation of Message Passing Interface (MPI for short) for python, MPI4py (https://mpi4py.readthedocs.io), which allows for parallelization of data structures within python code.

Although many of the common cases of parallelization can be solved with MPI4py alone, there are cases were big data structures cannot be distributed across cores within MPI4py infrastructure. This limitation is well known for MPI4py and happens due to the fact that MPI calls have a buffer limitation of 2GB entries.

In order to solve this problem, some solutions exist, like dividing the datasets in “chunks” that satisfy the data size criterion, or using other MPI implementations such as BigMPI (https://github.com/jeffhammond/BigMPI). BigMPI requires both understanding the syntax of BigMPI, as well as having to adapt python scripts to BigMPI, which can be difficult and requires knowledge of C-based programming languages, of which many users have a lack of. Then, the “chunking” strategy can be used in Python, but has to be adapted manually for data types and, in many cases, the number of elements that each node will receive which, in order to circumvent the 2 GB problem, can be difficult.

BigMPI4py adapts the “chunking” strategy of data, being able to automatically distribute the most common python data types used in python, such as numpy arrays, pandas dataframes, lists, nested lists, or lists of arrays and dataframes. Therefore, users of BigMPI4py can automatically parallelize their pipelines by calling BigMPI4py’s functions with their data.

So far, BigMPI4py implements certain MPI’s collective communication operations, like MPI.Comm.scatter(), MPI.Comm.bcast(), MPI.Comm.gather() or MPI.Comm.allgather(), which are the most commonly used ones in parallelization. BigMPI4py also implements point-to-point communication operation MPI.Comm.sendrecv().

BigMPI4py also detects whether a vectorized parallelization using MPI.Comm.Scatterv() and MPI.Comm.Gatherv() operations can be used, saving time for object communication.

Check out the tutorial notebook to see how to use BigMPI4py.

Also, look up our paper in bioRxiv to see how the software works. https://www.biorxiv.org/content/early/2019/01/17/517441

Alex M. Ascensión and Marcos J. Araúzo-Bravo. BigMPI4py: Python module for parallelization of Big Data objects; bioRxiv, (2019). doi: 10.1101/517441.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigmpi4py-1.1.tar.gz (16.3 kB view details)

Uploaded Source

File details

Details for the file bigmpi4py-1.1.tar.gz.

File metadata

  • Download URL: bigmpi4py-1.1.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for bigmpi4py-1.1.tar.gz
Algorithm Hash digest
SHA256 d96cb222b9801986f10b8958170ccfbf49fb58790b9454839a46add4d754c199
MD5 f51e22700940e472fcc67fe933e820fd
BLAKE2b-256 f27e38734ddde8faff5f1c67b0f6bc0dd11be706bb5819165af5fb67ad843a29

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page