Skip to main content

Wrapper for df and df[col].apply parallelized

Project description

pandas-parallel-apply

Parallel wrappers for df.apply(fn), df[col].apply(fn), series.apply(fn), and df.groupby([cols]).apply(fn), with tqdm progress bars included.

Installation

pip install pandas-parallel-apply

Import with:

from pandas_parallel_apply import DataFrameParallel, SeriesParallel

Examples

See examples/ for usage on some dummy dataframes and series.

Usage

# Apply on each row of a dataframe
df.apply(fn)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True).apply(fn)

# Apply on a column of a dataframe (returns a Series)
df[col].apply(fn, axis=1)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True)[col].apply(fn, axis=1)

# Apply on a series
series.apply(fn)
# -> 
SeriesParallel(series, n_cores: int = None, pbar: bool = True).apply(fn)

# GroupBy apply
df.groupby([cols]).apply(fn)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True).groupby([cols]).apply(fn)

How it works

It takes the length of your dataframe (or series, or grouper) = N and the n_cores provided to the constructors (K). It then splits the dataframe in K chunks of N/K size and spawns K new processes, each processing the desired chunks.

Only row-wise (perfect parallelable) operations are supported, so df.apply(fn, axis=1) is okay, but df.apply(fn, axis=0) is not because it may require rows that are on other workers.

It is assumed that each row is processed in similar time, so the N/K chunks will finishe more or less at the same time.

Future Improvement

Not supported but may be interesting: define also a number of chunks (C>K), so the df is actually split in N/C chunks, and theses are passed using a round-robin approach to the K processes. Right now, C=K, so whenever one process finishes, it will not be assigned any more work.

n_cores semantics

  • n_cores < -1 -> throws an error
  • n_cores == -1 -> uses cpu_count() - 1 cores
  • n_cores == 0 -> uses serial/standard pandas functions
  • n_cores == 1 -> spawns a single process alongside the main one
  • n_cores > 1 -> spanws N processes and chunks the df
  • n_cores > cpu_cpunt() -> throws an warning
  • n_cores > len(df) -> limits to len(df)

On CPU-bound tasks (calculations), n_cores = -1 is likely to be fastest. On network-bound operations (e.g., where threads may invoke network calls), using a very high n_cores value may be beneficial.

Disclaimers

  • This is an experimental repository. It may lead to unexpected behaviour.

  • Not all the merging semantics of pandas are supported. Pandas has weird and complex methods of converting an apply return. For example, a series apply function may return a dataframe, a series, a dict, a list, etc. All of these are converted in some specific way. Some cases may not be supported.

  • Groupby apply functions are much slower than their serial variant currently. Still experimenting with how to make it faster. It looks correct, just 10-100x slower for some small examples. May be better as dataframe get bigger.

  • Using n_cores = 1 will create a multiprocessing pool of just 1 core, so the code is parallel (thus not running on the main process), but may not yield much speed improvement, except for not blocking the main process. May be useful in some GUI apps.

  • You can use parallelism=multithread of parallelism=multiprocess (2nd is default) for all constructors. Using multiprocess requires the functions to be picklable though (lambda for example, must be global)

That's all.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas-parallel-apply-2.2.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

pandas_parallel_apply-2.2-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file pandas-parallel-apply-2.2.tar.gz.

File metadata

  • Download URL: pandas-parallel-apply-2.2.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.6 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.10.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.65.0 urllib3/1.26.15 CPython/3.10.6

File hashes

Hashes for pandas-parallel-apply-2.2.tar.gz
Algorithm Hash digest
SHA256 0c07dc5964ef7a635f8a635f7d158eb2a22bfd3cf20f580e60827b6df4952099
MD5 74517b046a198338a10000743798ccf7
BLAKE2b-256 ffb55fcc44f8b372bf93ca311432391ac310817ccb1a0c3e3fdfef1ffe6d3036

See more details on using hashes here.

File details

Details for the file pandas_parallel_apply-2.2-py3-none-any.whl.

File metadata

  • Download URL: pandas_parallel_apply-2.2-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.6 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.10.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.65.0 urllib3/1.26.15 CPython/3.10.6

File hashes

Hashes for pandas_parallel_apply-2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a9e284312440e135c3a23f7443b63627b91fd1f4a31228971687bc84691b8f41
MD5 fabf703a3ad3e74445775a84387dfdf7
BLAKE2b-256 2f1c78646c2d360417ee44495e998c6aa7985b363a3801e2cf4a794f0538e073

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page