A simple drop-in replacement for parallelized pandas `apply`
Project description
Parapply
A simple drop-in replacement for parallelized pandas apply()
on large Series / DataFrames, using joblib
. Works by dividing the Series / DataFrame into multiple chunks and running apply
concurrently. As a rule of thumb, use parapply
only if you have 10 million rows and above (see benchmark below).
Install by running pip install parapply
. Requires joblib
, numpy
, and pandas
(obviously!)
Simple Usage
Series: parapply(srs, fun)
instead of srs.apply(fun)
DataFrames: parapply(df, fun, axis)
instead of df.apply(fun, axis)
For more fine grain control:
+ n_jobs
to decide number of concurrent jobs,
+ n_chunks
for number of chunks to split the Series / DataFrame
Examples:
import pandas as pd
import numpy as np
from parapply import parapply
# Series example
np.random.seed(0)
srs = pd.Series(np.random.random(size=(5, )))
pd_apply_result = srs.apply(lambda x: x ** 2)
parapply_result = parapply(srs, lambda x: x ** 2)
print(pd_apply_result)
# 0 0.301196
# 1 0.511496
# 2 0.363324
# 3 0.296898
# 4 0.179483
# dtype: float64
print(parapply_result)
# 0 0.301196
# 1 0.511496
# 2 0.363324
# 3 0.296898
# 4 0.179483
# dtype: float64
# DataFrame example with axis = 1
np.random.seed(1)
df = pd.DataFrame(data={
'a': np.random.random(size=(5, )),
'b': np.random.random(size=(5, )),
'c': np.random.random(size=(5, )),
})
pd_apply_result = df.apply(sum, axis=1)
parapply_result = parapply(df, sum, axis=1)
print(pd_apply_result)
# 0 0.928555
# 1 1.591804
# 2 0.550127
# 3 1.577217
# 4 0.712960
# dtype: float64
print(parapply_result)
# 0 0.928555
# 1 1.591804
# 2 0.550127
# 3 1.577217
# 4 0.712960
# dtype: float64
Refer to docstrings for more information.
Quick and dirty benchmarks
Ran a quick and dirty benchmark to compare time taken to apply lambda x:x ** 2
to Series of varying length using pandas apply
and parapply
on multiple n_jobs
settings:
This semilog plot above shows that significant runtime differences between pandas apply
and parapply
show up at 10 million data points and onwards.
Acknowledgements
Thanks to @aaronlhe for introducing me to the world of unit tests!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file parapply-0.0.2.tar.gz
.
File metadata
- Download URL: parapply-0.0.2.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191030 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 090dd8db2bd6817c8ddad717a95726199a48f33ba245ef10b69d86c6b5bb3c26 |
|
MD5 | f57b186b73a6d4527db2ffb96716c780 |
|
BLAKE2b-256 | 57f913f826a1255476c4ad49098dfd2c8d194da7063d93dea799b15385fc2e12 |
File details
Details for the file parapply-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: parapply-0.0.2-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191030 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb7680ce72f7a4d8acd0979cc4015ee1b7292ef86f3839816ec642c637dbec98 |
|
MD5 | 8438d25cc14a1a45a653b3ca6dc5baea |
|
BLAKE2b-256 | e5a6499ac82a880b740145d200b98b4ff7ee74f51aafb55c712c4be9db581eb6 |