Wrapper for df and df[col].apply parallelized
Project description
pandas-parallel-apply
Parallel wrappers for df.apply(fn)
, df[col].apply(fn)
, series.apply(fn)
and df.groupby([cols]).apply(fn)
with tqdm included
Installation
pip install pandas-parallel-apply
Examples
See examples/
for usage on some dummy dataframe and series.
Usage
Apply on each row of a dataframe
df.apply(fn)
-> DataFrameParallel(df, n_cores: int = None, pbar: bool = True).apply(fn)
Apply on a column of a dataframe and return the Series
df[col].apply(fn, axis=1)
-> DataFrameParallel(df, n_cores: int = None, pbar: bool = True)[col].apply(fn, axis=1)
Apply on a series
series.apply(fn)
-> SeriesParallel(series, n_cores: int = None, pbar: bool = True).apply(fn)
GroupBy apply
df.groupby([cols]).apply(fn)
-> DataFrameParallel(df, n_cores: int = None, pbar: bool = True).groupby([cols]).apply(fn)
Disclaimers
-
This is an experimental repository. It may lead to unexpected behaviour.
-
Not all the merging semantics of pandas are supported. Pandas has weird and complex methods of converting an apply return. For example, a series apply function may return a dataframe, a series, a dict, a list etc. All of these are converted in some specific way. Some cases may not be supported
-
Groupby apply functions are much slower than their serial variant currently. Still experimenting with how to make it faster. It looks correct, just 10-100x slower for some small examples. May be better as dataframe get bigger.
-
Using
n_cores=0
will call the underlying pandas code directly, so the interface is just a wrapper. Usinngn_cores=1
will create a multiprocessing pool of just 1 core, so the code is parallel (thus not running on the main process), but may not yield much speed improvement, except for not blocking the main process. May be useful in some GUI apps -
We recommend only object oriented approach. You can use the internal
apply_on_df_parallel
,apply_on_df_col_parallel
,apply_on_series_parallel
,apply_on_groupby_parallel
, but it usually adds unnecessary complexity to the code. -
You can ignore the
n_cores
argument to all the constructors. If not set, it will default to the environment variablePANDAS_PARALLEL_APPLY_N_CORES
. If this is also not set, it defaults to 0 (serial apply).
That's all.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for pandas-parallel-apply-2.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48478156797cee9fd59922a703b59d76395c6c32fb3df1df2117d6b775339851 |
|
MD5 | be800da8267e99e5e38e2e610d9c395e |
|
BLAKE2b-256 | 11be99a514d69f3766e96747949b66a24347ee2458448d27246a1e0c6d13574d |