Skip to main content

A simple drop-in replacement for parallelized pandas `apply`

Project description

Parapply

A simple drop-in replacement for parallelized pandas apply() on large Series / DataFrames, using joblib. Works by dividing the Series / DataFrame into multiple chunks and running apply concurrently. As a rule of thumb, use parapply only if you have 1 million rows and above.

Simple Usage

Series: parapply(srs, fun) instead of srs.apply(fun) DataFrames: parapply(df, fun, axis) instead of df.apply(fun, axis)

For more fine-grained control: + n_jobs to decide number of concurrent jobs, + n_chunks for number of chunks to split the Series / DataFrame

Examples:

import pandas as pd
import numpy as np
from parapply import parapply

# Series example
np.random.seed(0)
srs = pd.Series(np.random.random(size=(5, )))
pd_apply_result = srs.apply(lambda x: x ** 2)
parapply_result = parapply(srs, lambda x: x ** 2)
print(pd_apply_result)

# 0    0.301196
# 1    0.511496
# 2    0.363324
# 3    0.296898
# 4    0.179483
# dtype: float64

print(parapply_result)

# 0    0.301196
# 1    0.511496
# 2    0.363324
# 3    0.296898
# 4    0.179483
# dtype: float64

# DataFrame example with axis = 1
np.random.seed(1)
df = pd.DataFrame(data={
    'a': np.random.random(size=(5, )),
    'b': np.random.random(size=(5, )),
    'c': np.random.random(size=(5, )),
})

pd_apply_result = df.apply(sum, axis=1)
parapply_result = parapply(df, sum, axis=1)
print(pd_apply_result)

# 0    0.928555
# 1    1.591804
# 2    0.550127
# 3    1.577217
# 4    0.712960
# dtype: float64

print(parapply_result)

# 0    0.928555
# 1    1.591804
# 2    0.550127
# 3    1.577217
# 4    0.712960
# dtype: float64

Refer to docstrings for more information.

Requirements

  • pandas (obviously)
  • numpy
  • joblib

Quick and dirty benchmark

TODO: To be updated

Installation

TODO: To be updated

Acknowledgements

Thanks to @aaronlhe for introducing me to the world of unit tests!

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for parapply, version 0.0.1
Filename, size File type Python version Upload date Hashes
Filename, size parapply-0.0.1-py3-none-any.whl (6.0 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size parapply-0.0.1.tar.gz (4.3 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page