Skip to main content

Wrapper for df and df[col].apply parallelized

Project description

pandas-parallel-apply

Parallel wrappers for df.apply(fn), df[col].apply(fn), series.apply(fn) and df.groupby([cols]).apply(fn) with tqdm included

Installation

pip install pandas-parallel-apply

Examples

See examples/ for usage on some dummy dataframe and series.

Usage

Apply on each row of a dataframe

df.apply(fn) -> DataFrameParallel(df, n_cores: int = None, pbar: bool = True).apply(fn)

Apply on a column of a dataframe and return the Series

df[col].apply(fn, axis=1) -> DataFrameParallel(df, n_cores: int = None, pbar: bool = True)[col].apply(fn, axis=1)

Apply on a series

series.apply(fn) -> SeriesParallel(series, n_cores: int = None, pbar: bool = True).apply(fn)

GroupBy apply

df.groupby([cols]).apply(fn) -> DataFrameParallel(df, n_cores: int = None, pbar: bool = True).groupby([cols]).apply(fn)

Disclaimers

  • This is an experimental repository. It may lead to unexpected behaviour.

  • Not all the merging semantics of pandas are supported. Pandas has weird and complex methods of converting an apply return. For example, a series apply function may return a dataframe, a series, a dict, a list etc. All of these are converted in some specific way. Some cases may not be supported

  • Groupby apply functions are much slower than their serial variant currently. Still experimenting with how to make it faster. It looks correct, just 10-100x slower for some small examples. May be better as dataframe get bigger.

  • Using n_cores=0 will call the underlying pandas code directly, so the interface is just a wrapper. Usinng n_cores=1 will create a multiprocessing pool of just 1 core, so the code is parallel (thus not running on the main process), but may not yield much speed improvement, except for not blocking the main process. May be useful in some GUI apps

  • We recommend only object oriented approach. You can use the internal apply_on_df_parallel, apply_on_df_col_parallel, apply_on_series_parallel, apply_on_groupby_parallel, but it usually adds unnecessary complexity to the code.

  • You can ignore the n_cores argument to all the constructors. If not set, it will default to the environment variable PANDAS_PARALLEL_APPLY_N_CORES. If this is also not set, it defaults to 0 (serial apply).

That's all.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas-parallel-apply-2.0.tar.gz (7.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page