Skip to main content

A simple drop-in replacement for parallelized pandas `apply`

Project description

Parapply

A simple drop-in replacement for parallelized pandas apply() on large Series / DataFrames, using joblib. Works by dividing the Series / DataFrame into multiple chunks and running apply concurrently. As a rule of thumb, use parapply only if you have 10 million rows and above (see benchmark below).

Install by running pip install parapply. Requires joblib, numpy, and pandas (obviously!)

Simple Usage

Series: parapply(srs, fun) instead of srs.apply(fun) DataFrames: parapply(df, fun, axis) instead of df.apply(fun, axis)

For more fine grain control: + n_jobs to decide number of concurrent jobs, + n_chunks for number of chunks to split the Series / DataFrame

Examples:

import pandas as pd
import numpy as np
from parapply import parapply

# Series example
np.random.seed(0)
srs = pd.Series(np.random.random(size=(5, )))
pd_apply_result = srs.apply(lambda x: x ** 2)
parapply_result = parapply(srs, lambda x: x ** 2)
print(pd_apply_result)

# 0    0.301196
# 1    0.511496
# 2    0.363324
# 3    0.296898
# 4    0.179483
# dtype: float64

print(parapply_result)

# 0    0.301196
# 1    0.511496
# 2    0.363324
# 3    0.296898
# 4    0.179483
# dtype: float64

# DataFrame example with axis = 1
np.random.seed(1)
df = pd.DataFrame(data={
    'a': np.random.random(size=(5, )),
    'b': np.random.random(size=(5, )),
    'c': np.random.random(size=(5, )),
})

pd_apply_result = df.apply(sum, axis=1)
parapply_result = parapply(df, sum, axis=1)
print(pd_apply_result)

# 0    0.928555
# 1    1.591804
# 2    0.550127
# 3    1.577217
# 4    0.712960
# dtype: float64

print(parapply_result)

# 0    0.928555
# 1    1.591804
# 2    0.550127
# 3    1.577217
# 4    0.712960
# dtype: float64

Refer to docstrings for more information.

Quick and dirty benchmarks

Ran a quick and dirty benchmark to compare time taken to apply lambda x:x ** 2 to Series of varying length using pandas apply and parapply on multiple n_jobs settings:

Runtime vs log(num data points)

This semilog plot above shows that significant runtime differences between pandas apply and parapply show up at 10 million data points and onwards.

Acknowledgements

Thanks to @aaronlhe for introducing me to the world of unit tests!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parapply-0.0.2.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

parapply-0.0.2-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file parapply-0.0.2.tar.gz.

File metadata

  • Download URL: parapply-0.0.2.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191030 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.5

File hashes

Hashes for parapply-0.0.2.tar.gz
Algorithm Hash digest
SHA256 090dd8db2bd6817c8ddad717a95726199a48f33ba245ef10b69d86c6b5bb3c26
MD5 f57b186b73a6d4527db2ffb96716c780
BLAKE2b-256 57f913f826a1255476c4ad49098dfd2c8d194da7063d93dea799b15385fc2e12

See more details on using hashes here.

File details

Details for the file parapply-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: parapply-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191030 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.5

File hashes

Hashes for parapply-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cb7680ce72f7a4d8acd0979cc4015ee1b7292ef86f3839816ec642c637dbec98
MD5 8438d25cc14a1a45a653b3ca6dc5baea
BLAKE2b-256 e5a6499ac82a880b740145d200b98b4ff7ee74f51aafb55c712c4be9db581eb6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page