Wrapper for df and df[col].apply parallelized
Project description
pandas-parallel-apply
Parallel wrappers for df.apply(fn)
, df[col].apply(fn)
, series.apply(fn)
, and df.groupby([cols]).apply(fn)
, with tqdm progress bars included.
Installation
pip install pandas-parallel-apply
Import with:
from pandas_parallel_apply import DataFrameParallel, SeriesParallel
Examples
See examples/
for usage on some dummy dataframes and series.
Usage
# Apply on each row of a dataframe
df.apply(fn)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True).apply(fn)
# Apply on a column of a dataframe (returns a Series)
df[col].apply(fn, axis=1)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True)[col].apply(fn, axis=1)
# Apply on a series
series.apply(fn)
# ->
SeriesParallel(series, n_cores: int = None, pbar: bool = True).apply(fn)
# GroupBy apply
df.groupby([cols]).apply(fn)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True).groupby([cols]).apply(fn)
How it works
It takes the length of your dataframe (or series, or grouper) = N and the n_cores
provided to the constructors (K).
It then splits the dataframe in K chunks of N/K size and spawns K new processes, each processing the desired chunks.
Only row-wise (perfect parallelable) operations are supported, so df.apply(fn, axis=1)
is okay, but
df.apply(fn, axis=0)
is not because it may require rows that are on other workers.
It is assumed that each row is processed in similar time, so the N/K chunks will finishe more or less at the same time.
Future Improvement
Not supported but may be interesting: define also a number of chunks (C>K), so the df is actually split in N/C chunks, and theses are passed using a round-robin approach to the K processes. Right now, C=K, so whenever one process finishes, it will not be assigned any more work.
n_cores semantics
n_cores < -1
-> throws an errorn_cores == -1
-> usescpu_count()
- 1 coresn_cores == 0
-> uses serial/standard pandas functionsn_cores == 1
-> spawns a single process alongside the main onen_cores > 1
-> spanws N processes and chunks the dfn_cores > cpu_cpunt()
-> throws an warningn_cores > len(df)
-> limits tolen(df)
On CPU-bound tasks (calculations), n_cores = -1
is likely to be fastest. On network-bound operations (e.g., where threads may invoke network calls),
using a very high n_cores
value may be beneficial.
Disclaimers
-
This is an experimental repository. It may lead to unexpected behaviour.
-
Not all the merging semantics of pandas are supported. Pandas has weird and complex methods of converting an apply return. For example, a series apply function may return a dataframe, a series, a dict, a list, etc. All of these are converted in some specific way. Some cases may not be supported.
-
Groupby apply functions are much slower than their serial variant currently. Still experimenting with how to make it faster. It looks correct, just 10-100x slower for some small examples. May be better as dataframe get bigger.
-
Using
n_cores = 1
will create a multiprocessing pool of just 1 core, so the code is parallel (thus not running on the main process), but may not yield much speed improvement, except for not blocking the main process. May be useful in some GUI apps.
That's all.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pandas-parallel-apply-2.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca0c508e0e13ef5211e67ea9603c118a9a7d693a44e24eaad77d6e5a389682b7 |
|
MD5 | e58da36236672507543e5ce4db370f5b |
|
BLAKE2b-256 | 05aa61cf45ff71187f7387d4111c501ee7b67c4397947871c780e7057449a857 |
Hashes for pandas_parallel_apply-2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08c85da5134a5bb5b852917b10125e5c6aead72a30a30271e51073a7841dd48b |
|
MD5 | 3e63f240439dcc7509ed5c09de5a7771 |
|
BLAKE2b-256 | 257ce3f00789e0205c222bb43cb243df3e64cb330e7d611426db811b4bf1c924 |