Generator that yields hash chunks for distributed data processing.
Project description
Hash Chunker
Generator that yields hash chunks for distributed data processing.
TLDR
pip install hash-chunker
from hash_chunker import HashChunker
chunks = list(HashChunker().get_chunks(chunk_size=1, all_items_count=2))
assert chunks == [("0000000000", "8000000000"), ("8000000000", "ffffffffff")]
Description
Imagine a situation when you need to process huge amount data rows in parallel. Each data row has a hash field and the task is to use it for chunking.
Possible reasons for using hash field and not int id field:
- No auto increment id field.
- Id field has many blank lines (1,2,3, 100500, 100501, 1000000).
- Chunking by id will break data that must be in one chunk to different chunks (in user behavioral analytics id can be autoincrement for all users actions and user_session hash is linked to concrete user, so if we chunk by id one user session may not be in one chunk).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hash_chunker-0.1.2.tar.gz
(3.4 kB
view hashes)
Built Distribution
Close
Hashes for hash_chunker-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1d9af5285678cdf127b222d75a791645b74436467aa0b2afa802d7890cb1e8a |
|
MD5 | 78aa9e63edd1053e07d01c1a72be245d |
|
BLAKE2b-256 | 85c774288cb744dcd97148195a4f3dd218276adb9b84eda2be0c0995d4f6050b |