Generator that yields hash chunks for distributed data processing.
Project description
Hash Chunker
Generator that yields hash chunks for distributed data processing.
TLDR
pip install hash-chunker
from hash_chunker import HashChunker
chunks = list(HashChunker().get_chunks(chunk_size=1000, all_items_count=2000))
assert chunks == [("0000000000", "8000000000"), ("8000000000", "ffffffffff")]
Description
Imagine a situation when you need to process huge amount data rows in parallel. Each data row has a hash field and the task is to use it for chunking.
Possible reasons for using hash field and not int id field:
- No auto increment id field.
- Id field has many blank lines (1,2,3, 100500, 100501, 1000000).
- Chunking by id will break data that must be in one chunk to different chunks (in user behavioral analytics id can be autoincrement for all users actions and user_session hash is linked to concrete user, so if we chunk by id one user session may not be in one chunk).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hash_chunker-0.1.4.tar.gz
(3.5 kB
view hashes)
Built Distribution
Close
Hashes for hash_chunker-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 00238c3f3504f3952e51c598efc038d158e6319b1b94a19d26d84720adc268e2 |
|
MD5 | f62b1fd5fb9dca5728be44e8ac878130 |
|
BLAKE2b-256 | 4c86b809cf5e6d54392a5d87a87332586bbedddf67073b24d5295bad873fbacf |