Parallel-pandas
Makes it easy to parallelize your calculations in pandas on all your CPUs.
Installation
Quickstart
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)
# create big DataFrame
df = pd.DataFrame(np.random.random((1_000_000, 100)))
# calculate multiple quantiles. Pandas only uses one core of CPU
%%timeit
res = df.quantile(q=[.25, .5, .95], axis=1)
3.66 s ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#p_quantile is parallel analogue of quantile methods. Can use all cores of your CPU.
%%timeit
res = df.p_quantile(q=[.25, .5, .95], axis=1)
679 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see the p_quantile
method is 5 times faster!
Usage
Warning: You will only notice performance gains if your data is very large.
Under the hood, **parallel-pandas** works very simply. The Dataframe or Series is split into chunks along the first or second axis. Then these chunks are passed to a pool of processes or threads where the desired method is executed on each part. At the end, the parts are concatenated to get the final result.
When initializing parallel-pandas you can specify the following options:
n_cpu
- the number of cores of your CPU that you want to use (default None
- use all cores of CPU)
split_factor
- Affects the number of chunks into which the DataFrame/Series is split according to the formula chunks_number = split_factor*n_cpu
(default 1).
show_vmem
- Shows a progress bar with available RAM (default False
)
disable_pr_bar
- Disable the progress bar for parallel tasks (default False
)
For example
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=False)
# create big DataFrame
df = pd.DataFrame(np.random.random((1_000_000, 100)))
During initialization, we specified split_factor=4
and n_cpu = 16
, so the DataFrame will be split into 64 chunks (in the case of the describe
method, axis = 1) and the progress bar shows the progress for each chunk
Parallel counterparts for pandas Series methods
methods |
parallel analogue |
executor |
pd.Series.apply() |
pd.Series.p_apply() |
threads / processes |
pd.Series.map() |
pd.Series.p_map() |
threads / processes |
Parallel counterparts for pandas SeriesGroupBy methods
methods |
parallel analogue |
executor |
pd.SeriesGroupBy.apply() |
pd.SeriesGroupBy.p_apply() |
threads / processes |
Parallel counterparts for pandas Dataframe methods
methods |
parallel analogue |
executor |
df.mean() |
df.p_mean() |
threads |
df.min() |
df.p_min() |
threads |
df.max() |
df.p_max() |
threads |
df.median() |
df.p_max() |
threads |
df.kurt() |
df.p_kurt() |
threads |
df.skew() |
df.p_skew() |
threads |
df.sum() |
df.p_sum() |
threads |
df.prod() |
df.p_prod() |
threads |
df.var() |
df.p_var() |
threads |
df.sem() |
df.p_sem() |
threads |
df.std() |
df.p_std() |
threads |
df.cummin() |
df.p_cummin() |
threads |
df.cumsum() |
df.p_cumsum() |
threads |
df.cummax() |
df.p_cummax() |
threads |
df.cumprod() |
df.p_cumprod() |
threads |
df.apply() |
df.p_apply() |
threads / processes |
df.applymap() |
df.p_applymap() |
processes |
df.replace() |
df.p_replace() |
threads |
df.describe() |
df.p_describe() |
threads |
df.nunique() |
df.p_nunique() |
threads / processes |
df.mad() |
df.p_mad() |
threads |
df.idxmin() |
df.p_idxmin() |
threads |
df.idxmax() |
df.p_idxmax() |
threads |
df.rank() |
df.p_rank() |
threads |
df.mode() |
df.p_mode() |
threads/processes |
Parallel counterparts for pandas DataframeGroupBy methods
methods |
parallel analogue |
executor |
DataFrameGroupBy.apply() |
DataFrameGroupBy.p_apply() |
threads / processes |
Parallel counterparts for pandas window methods
Rolling
methods |
parallel analogue |
executor |
pd.core.window.Rolling.apply() |
pd.core.window.Rolling.p_apply() |
threads / processes |
pd.core.window.Rolling.min() |
pd.core.window.Rolling.p_min() |
threads / processes |
pd.core.window.Rolling.max() |
pd.core.window.Rolling.p_max() |
threads / processes |
pd.core.window.Rolling.mean() |
pd.core.window.Rolling.p_mean() |
threads / processes |
pd.core.window.Rolling.sum() |
pd.core.window.Rolling.p_sum() |
threads / processes |
pd.core.window.Rolling.var() |
pd.core.window.Rolling.p_var() |
threads / processes |
pd.core.window.Rolling.sem() |
pd.core.window.Rolling.p_sem() |
threads / processes |
pd.core.window.Rolling.skew() |
pd.core.window.Rolling.p_skew() |
threads / processes |
pd.core.window.Rolling.kurt() |
pd.core.window.Rolling.p_kurt() |
threads / processes |
pd.core.window.Rolling.median() |
pd.core.window.Rolling.p_median() |
threads / processes |
pd.core.window.Rolling.quantile() |
pd.core.window.Rolling.p_quantile() |
threads / processes |
pd.core.window.Rolling.rank() |
pd.core.window.Rolling.p_rank() |
threads / processes |
RollingGroupby
methods |
parallel analogue |
executor |
pd.core.window.RollingGroupby.apply() |
pd.core.window.RollingGroupby.p_apply() |
threads / processes |
pd.core.window.RollingGroupby.min() |
pd.core.window.RollingGroupby.p_min() |
threads / processes |
pd.core.window.RollingGroupby.max() |
pd.core.window.RollingGroupby.p_max() |
threads / processes |
pd.core.window.RollingGroupby.mean() |
pd.core.window.RollingGroupby.p_mean() |
threads / processes |
pd.core.window.RollingGroupby.sum() |
pd.core.window.RollingGroupby.p_sum() |
threads / processes |
pd.core.window.RollingGroupby.var() |
pd.core.window.RollingGroupby.p_var() |
threads / processes |
pd.core.window.RollingGroupby.sem() |
pd.core.window.RollingGroupby.p_sem() |
threads / processes |
pd.core.window.RollingGroupby.skew() |
pd.core.window.RollingGroupby.p_skew() |
threads / processes |
pd.core.window.RollingGroupby.kurt() |
pd.core.window.RollingGroupby.p_kurt() |
threads / processes |
pd.core.window.RollingGroupby.median() |
pd.core.window.RollingGroupby.p_median() |
threads / processes |
pd.core.window.RollingGroupby.quantile() |
pd.core.window.RollingGroupby.p_quantile() |
threads / processes |
pd.core.window.RollingGroupby.rank() |
pd.core.window.RollingGroupby.p_rank() |
threads / processes |
Expanding
methods |
parallel analogue |
executor |
pd.core.window.Expanding.apply() |
pd.core.window.Expanding.p_apply() |
threads / processes |
pd.core.window.Expanding.min() |
pd.core.window.Expanding.p_min() |
threads / processes |
pd.core.window.Expanding.max() |
pd.core.window.Expanding.p_max() |
threads / processes |
pd.core.window.Expanding.mean() |
pd.core.window.Expanding.p_mean() |
threads / processes |
pd.core.window.Expanding.sum() |
pd.core.window.Expanding.p_sum() |
threads / processes |
pd.core.window.Expanding.var() |
pd.core.window.Expanding.p_var() |
threads / processes |
pd.core.window.Expanding.sem() |
pd.core.window.Expanding.p_sem() |
threads / processes |
pd.core.window.Expanding.skew() |
pd.core.window.Expanding.p_skew() |
threads / processes |
pd.core.window.Expanding.kurt() |
pd.core.window.Expanding.p_kurt() |
threads / processes |
pd.core.window.Expanding.median() |
pd.core.window.Expanding.p_median() |
threads / processes |
pd.core.window.Expanding.quantile() |
pd.core.window.Expanding.p_quantile() |
threads / processes |
pd.core.window.Expanding.rank() |
pd.core.window.Expanding.p_rank() |
threads / processes |
ExpandingGroupby
methods |
parallel analogue |
executor |
pd.core.window.ExpandingGroupby.apply() |
pd.core.window.ExpandingGroupby.p_apply() |
threads / processes |
pd.core.window.ExpandingGroupby.min() |
pd.core.window.ExpandingGroupby.p_min() |
threads / processes |
pd.core.window.ExpandingGroupby.max() |
pd.core.window.ExpandingGroupby.p_max() |
threads / processes |
pd.core.window.ExpandingGroupby.mean() |
pd.core.window.ExpandingGroupby.p_mean() |
threads / processes |
pd.core.window.ExpandingGroupby.sum() |
pd.core.window.ExpandingGroupby.p_sum() |
threads / processes |
pd.core.window.ExpandingGroupby.var() |
pd.core.window.ExpandingGroupby.p_var() |
threads / processes |
pd.core.window.ExpandingGroupby.sem() |
pd.core.window.ExpandingGroupby.p_sem() |
threads / processes |
pd.core.window.ExpandingGroupby.skew() |
pd.core.window.ExpandingGroupby.p_skew() |
threads / processes |
pd.core.window.ExpandingGroupby.kurt() |
pd.core.window.ExpandingGroupby.p_kurt() |
threads / processes |
pd.core.window.ExpandingGroupby.median() |
pd.core.window.ExpandingGroupby.p_median() |
threads / processes |
pd.core.window.ExpandingGroupby.quantile() |
pd.core.window.ExpandingGroupby.p_quantile() |
threads / processes |
pd.core.window.ExpandingGroupby.rank() |
pd.core.window.ExpandingGroupby.p_rank() |
threads / processes |
ExponentialMovingWindow
methods |
parallel analogue |
executor |
pd.core.window.ExponentialMovingWindow.mean() |
pd.core.window.ExponentialMovingWindow.p_mean() |
threads / processes |
pd.core.window.ExponentialMovingWindow.sum() |
pd.core.window.ExponentialMovingWindow.p_sum() |
threads / processes |
pd.core.window.ExponentialMovingWindow.var() |
pd.core.window.ExponentialMovingWindow.p_var() |
threads / processes |
pd.core.window.ExponentialMovingWindow.std() |
pd.core.window.ExponentialMovingWindow.p_std() |
threads / processes |
ExponentialMovingWindowGroupby
methods |
parallel analogue |
executor |
pd.core.window.ExponentialMovingWindowGroupby.mean() |
pd.core.window.ExponentialMovingWindowGroupby.p_mean() |
threads / processes |
pd.core.window.ExponentialMovingWindowGroupby.sum() |
pd.core.window.ExponentialMovingWindowGroupby.p_sum() |
threads / processes |
pd.core.window.ExponentialMovingWindowGroupby.var() |
pd.core.window.ExponentialMovingWindowGroupby.p_var() |
threads / processes |
pd.core.window.ExponentialMovingWindowGroupby.std() |
pd.core.window.ExponentialMovingWindowGroupby.p_std() |
threads / processes |