Chunk-based, multiprocess processing of iterables.
Project description
multiprocess_chunks
Chunk-based, multiprocess processing of iterables.
Uses the multiprocess
package to perform the multiprocessization.
Uses the cloudpickle
to pickle hard-to-pickle objects.
Why is this useful?
When using the built-in Python multiprocessing.Pool.map
method the items being iterated are individually pickled. This can lead to a lot of pickling which can negatively affect performance. This is particularly true, and not necessarily obvious, if extra data is passed into the f
function via a lambda. For example:
from multiprocessing import Pool
d = {...} # a large dict of some sort
p.map(lamda x: x + d[x], [1, 2, 3, ...])
In this case both x
and d
are pickled, individually, for every item in [1, 2, 3, ...]
.
The methods in this package divide the [1, 2, 3, ...]
list into chunks and pickle each chunk and d
a small number of times.
Installation
pip install multiprocess-chunks
Usage
There are two methods to choose from: map_list_as_chunks
and map_list_in_chunks
.
map_list_as_chunks
This method divides the iterable that is passed to it into chunks. The chunks are then processed in multiprocess. It returns the mapped chunks.
Parameters:
def map_list_as_chunks(l, f, extra_data, cpus=None, max_chunk_size=None
)
l
: The iterable to process in multiprocess.f
: The function that processes each chunk. It takes two parameters: -chunk, extra_data
extra_data
: Data that is passed intof
for each chunk.cpus
: The number of CPUs to use. IfNone
the number of cores on the system will be used. This value decides how many chunks to create.max_chunk_size
: Limits the chunk size.
Example:
from multiprocess_chunks import map_list_as_chunks
l = range(0, 10)
f = lambda chunk, ed: [c * ed for c in chunk]
result = map_list_as_chunks(l, f, 5, 2)
# result = [ [0, 5, 10, 15, 20], [25, 30, 35, 40, 45] ]
map_list_in_chunks
This method divides the iterable that is passed to it into chunks. The chunks are then processed in multiprocess. It unwinds the processed chunks to return the processed items.
Parameters:
def map_list_in_chunks(l, f, extra_data)
l
: The iterable to process in multiprocess.f
: The function that processes each chunk. It takes two parameters:item, extra_data
extra_data
: Data that is passed intof
for each chunk.
Example:
from multiprocess_chunks import map_list_in_chunks
l = range(0, 10)
f = lambda item, ed: item * ed
result = map_list_in_chunks(l, f, 5)
# result = [0, 5, 10, 15, 20 25, 30, 35, 40, 45]
Essentially, map_list_in_chunks
gives the same output as multiprocessing.Pool.map
but, behind the scenes, it is chunking and being efficient about pickling.
A note on pickling
This package uses the pathos
package to perform the multiprocessization and the cloudpickle
package to perform pickling. This allows it to pickle objects that Python's built-in multiprocessing
cannot.
Performance
How much better will your code perform? There are many factors at play here. The only way to know is to do your own timings.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file multiprocess_chunks-1.0.0.tar.gz
.
File metadata
- Download URL: multiprocess_chunks-1.0.0.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dfdb6e18979779340b7289c4c188168577218ccfd1f40b6ec841b1c4b328c3c6 |
|
MD5 | d9cf53429420c9d18a711c9eb14a5187 |
|
BLAKE2b-256 | 99bc67af1aeab9efce27301e01b7e0fc94634d843c9a27f21ecfc12b97829201 |