An easy to use library to speed up computation (by parallelizing on multi CPUs) with pandas.

These details have not been verified by PyPI

Project links

Homepage

Project description

Pandaral·lel

Without parallelization
With parallelization

Installation

$ pip install pandarallel [--upgrade] [--user]

Requirements

On Windows, Pandaral·lel will works only if the Python session (python, ipython, jupyter notebook, jupyter lab, ...) is executed from Windows Subsystem for Linux (WSL).

On Linux & macOS, nothing special has to be done.

Warning

Parallelization has a cost (instantiating new processes, sending data via shared memory, ...), so parallelization is efficient only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallelization is not always worth it.

Examples

An example of each API is available here.

Benchmark

For some examples, here is the comparative benchmark with and without using Pandaral·lel.

Computer used for this benchmark:

OS: Linux Ubuntu 16.04
Hardware: Intel Core i7 @ 3.40 GHz - 4 cores

Benchmark

For those given examples, parallel operations run approximately 4x faster than the standard operations (except for series.map which runs only 3.2x faster).

API

First, you have to import pandarallel:

from pandarallel import pandarallel

Then, you have to initialize it.

pandarallel.initialize()

This method takes 5 optional parameters:

shm_size_mb: Deprecated.
nb_workers: Number of workers used for parallelization. (int) If not set, all available CPUs will be used.
progress_bar: Display progress bars if set to True. (bool, False by default)
verbose: The verbosity level (int, 2 by default)
- 0 - don't display any logs
- 1 - display only warning logs
- 2 - display all logs
use_memory_fs: (bool, None by default)
- If set to None and if memory file system is available, Pandarallel will use it to transfer data between the main process and workers. If memory file system is not available, Pandarallel will default on multiprocessing data transfer (pipe).
- If set to True, Pandarallel will use memory file system to transfer data between the main process and workers and will raise a SystemError if memory file system is not available.
- If set to False, Pandarallel will use multiprocessing data transfer (pipe) to transfer data between the main process and workers.

Using memory file system reduces data transfer time between the main process and workers, especially for big data.

Memory file system is considered as available only if the directory /dev/shm exists and if the user has read and write rights on it.

Basically, memory file system is only available on some Linux distributions (including Ubuntu).

With df a pandas DataFrame, series a pandas Series, func a function to apply/map, args, args1, args2 some arguments, and col_name a column name:

Without parallelization	With parallelization
`df.apply(func)`	`df.parallel_apply(func)`
`df.applymap(func)`	`df.parallel_applymap(func)`
`df.groupby(args).apply(func)`	`df.groupby(args).parallel_apply(func)`
`df.groupby(args1).col_name.rolling(args2).apply(func)`	`df.groupby(args1).col_name.rolling(args2).parallel_apply(func)`
`df.groupby(args1).col_name.expanding(args2).apply(func)`	`df.groupby(args1).col_name.expanding(args2).parallel_apply(func)`
`series.map(func)`	`series.parallel_map(func)`
`series.apply(func)`	`series.parallel_apply(func)`
`series.rolling(args).apply(func)`	`series.rolling(args).parallel_apply(func)`

You will find a complete example here for each row in this table.

Troubleshooting

I have 8 CPUs but parallel_apply speeds up computation only about x4. Why?

Actually Pandarallel can only speed up computation until about the number of cores your computer has. The majority of recent CPUs (like Intel Core i7) uses hyperthreading. For example, a 4-core hyperthreaded CPU will show 8 CPUs to the operating system, but will really have only 4 physical computation units.

On Ubuntu, you can get the number of cores with $ grep -m 1 'cpu cores' /proc/cpuinfo.

I use Jupyter Lab and instead of progress bars, I see these kind of things:
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625000), Label(value='0 / 625000')…

Run the following 3 lines, and you should be able to see the progress bars:

$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager

(You may also have to install nodejs if asked)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.6.5

May 2, 2023

1.6.4

Jan 15, 2023

1.6.3

Aug 9, 2022

1.6.2

Aug 3, 2022

1.6.1

Mar 15, 2022

1.6.0

Mar 14, 2022

1.5.8

Mar 12, 2022

1.5.7

Mar 3, 2022

This version

1.5.6

Mar 3, 2022

1.5.5

Feb 6, 2022

1.5.4

Oct 17, 2021

1.5.3

Oct 4, 2021

1.5.2

Feb 4, 2021

1.5.1

Aug 25, 2020

1.5.0

Aug 24, 2020

1.4.8

Apr 5, 2020

1.4.7

Apr 5, 2020

1.4.6

Mar 1, 2020

1.4.5

Jan 20, 2020

1.4.4

Jan 1, 2020

1.4.3

Jan 1, 2020

1.4.2

Nov 28, 2019

1.4.1

Nov 11, 2019

1.4.0

Nov 9, 2019

1.3.4

Nov 2, 2019

1.3.3

Oct 6, 2019

1.3.2

Aug 3, 2019

1.3.1

Aug 2, 2019

1.3.0

Jul 23, 2019

1.2.0

Jul 9, 2019

1.1.1

May 13, 2019

1.1.0

Apr 4, 2019

1.0.0

Apr 1, 2019

0.1.7

Mar 31, 2019

0.1.6

Mar 31, 2019

0.1.5

Mar 26, 2019

0.1.4

Mar 24, 2019

0.1.3

Mar 24, 2019

0.1.2

Mar 16, 2019

0.1.1

Mar 11, 2019

0.1.0

Mar 10, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandarallel-1.5.6.tar.gz (15.6 kB view details)

Uploaded Mar 3, 2022 Source

File details

Details for the file pandarallel-1.5.6.tar.gz.

File metadata

Download URL: pandarallel-1.5.6.tar.gz
Upload date: Mar 3, 2022
Size: 15.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.0a7

File hashes

Hashes for pandarallel-1.5.6.tar.gz
Algorithm	Hash digest
SHA256	`aa85ad3a9c6242ab02fa1c7101cafd78c9ba546948482aaa667671a232677f07`
MD5	`309f23d69cf23bf77a9b5eb39645597c`
BLAKE2b-256	`e6973d88084964cf6b4618027b18cff2ce42803a56f2a131cbdda60a70808085`

See more details on using hashes here.

pandarallel 1.5.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pandaral·lel

Installation

Requirements

Warning

Examples

Benchmark

API

Troubleshooting

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes