Skip to main content

An easy to use library to speed up computation (by parallelizing on multi CPUs) with pandas.

Project description

# Pandaral·lel
An easy to use library to speed up computation (by parallelizing on multi CPUs) with [pandas](https://pandas.pydata.org/).


| Without parallelisation | ![Without Pandarallel](https://github.com/nalepae/pandarallel/blob/master/docs/progress_apply.gif) |
| :----------------------: | -------------------------------------------------------------------------------------------------------- |
| **With parallelisation** | ![With Pandarallel](https://github.com/nalepae/pandarallel/blob/master/docs/progress_parallel_apply.gif) |

<table>
<tr>
<td>Latest Release</td>
<td>
<a href="https://pypi.org/project/pandarallel/">
<img src="https://img.shields.io/pypi/v/pandarallel.svg" alt="latest release" />
</a>
</td>
</tr>
<tr>
<td>License</td>
<td>
<a href="https://github.com/nalepae/pandarallel/blob/master/LICENSE">
<img src="https://img.shields.io/pypi/l/pandarallel.svg" alt="license" />
</a>
</td>
</tr>
</table>

## Installation
`$ pip install pandarallel [--user]`


## Requirements
- [pandas](https://pypi.org/project/pandas/)
- [pyarrow](https://pypi.org/project/pyarrow/)


## Warnings
- The V1.0 of this library is not yet released. API is able to change at any time.
- Parallelization has a cost (instanciating new processes, transmitting data via shared memory, etc ...), so parallelization is efficiant only if the amount of computation to parallelize is high enough. For very little amount of data, using parallezation not always worth it.
- Functions applied should NOT be lambda functions.

```python
from pandarallel import pandarallel
from math import sin

pandarallel.initialize()

# FORBIDDEN
df.parallel_apply(lambda x: sin(x**2), axis=1)

# ALLOWED
def func(x):
return sin(x**2)

df.parallel_apply(func, axis=1)
```

## Examples
An example of each API is available [here](https://github.com/nalepae/pandarallel/blob/master/docs/examples.ipynb).

## Benchmark
For the `Dataframe.apply` example [here](https://github.com/nalepae/pandarallel/blob/master/docs/examples.ipynb), here is the comparative benchmark with "standard" `apply` and with `parallel_apply` (error bars are too small to be displayed).
Computer used for this benchmark:
- OS: Linux Ubuntu 16.04
- Hardware: Intel Core i7 @ 3.40 GHz (4 cores)
- Number of workers (parallel processes) used: 4

![Benchmark](https://github.com/nalepae/pandarallel/blob/master/docs/apply_vs_parallel_apply.png)

For this given example, `parallel_apply` runs approximatively 3.7 faster than the "standard" `apply`.


## API
First, you have to import `pandarallel`:
```python
from pandarallel import pandarallel
```

Then, you have to initialize it.
```python
pandarallel.initialize()
```
This method takes 3 optional parameters:
- `shm_size_mo`: The size of the Pandarallel shared memory in Mo. If the
default one is too small, it is possible to set a larger one. By default,
it is set to 2 Go. (int)
- `nb_workers`: The number of workers. By default, it is set to the number
of cores your operating system sees. (int)
- `progress_bar`: Put it to `True` to display a progress bar.
**WARNING**: Progress bar is an experimental feature. This can lead to a
sensitive performance loss.
Not available for `DataFrameGroupy.parallel_apply`.

With `df` a pandas DataFrame, `series` a pandas Series & `func` a function to
apply/map:

| Without parallelisation | With parallelisation |
| -------------------------------------------------- | ----------------------------------------------------------- |
| `df.apply(func)` | `df.parallel_apply(func)` |
| `df.applymap(func)` | `df.parallel_applymap(func)` |
| `df.groupby(<args>).apply(func)` | `df.groupby(<args>).parallel_apply(func)` |
| `series.map(func)` | `series.parallel_map(func)` |
| `series.apply(func)` | `series.parallel_apply(func)` |
| `series.rolling(<args>).apply(func)` | `series.rolling(<args>).parallel_apply(func)` |
| `df.groupby(<args1>).rolling(<args2>).apply(func)` | `df.groupby(<args1>).rolling(<args2>).parallel_apply(func)` |

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandarallel-0.1.7.tar.gz (5.5 kB view details)

Uploaded Source

File details

Details for the file pandarallel-0.1.7.tar.gz.

File metadata

  • Download URL: pandarallel-0.1.7.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.1

File hashes

Hashes for pandarallel-0.1.7.tar.gz
Algorithm Hash digest
SHA256 349dc52f77dbc1d885523288c477428eaa1c19d3145f8c4434c18fc0dd4339ce
MD5 10a086a7aa4fcf843f6b4523a276ee77
BLAKE2b-256 662708af0ae1c74a8e5d4113a04c45d0289ca1a110484c28fc497d2ff8307c12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page