Generator that yields hash chunks for distributed data processing.
Project description
Hash Chunker
Generator that yields hash chunks for distributed data processing.
TLDR
pip install hash-chunker
from hash_chunker import HashChunker
chunks = list(HashChunker().get_chunks(chunk_size=1000, all_items_count=2000))
assert chunks == [("", "8000000000"), ("8000000000", "ffffffffff")]
# use chunks as tasks for multiprocessing
query_part = "hash > :from_hash AND hash <= :to_hash"
params = {"from_hash": chunk[0], "to_hash": chunk[1]}
Description
Imagine a situation when you need to process huge amount data rows in parallel. Each data row has a hash field and the task is to use it for chunking.
Possible reasons for using hash field and not int id field:
- No auto increment id field.
- Id field has many blank lines (1,2,3, 100500, 100501, 1000000).
- Chunking by id will break data that must be in one chunk to different chunks (in user behavioral analytics id can be autoincrement for all users actions and user_session hash is linked to concrete user, so if we chunk by id one user session may not be in one chunk).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hash_chunker-0.1.5.tar.gz
(3.6 kB
view hashes)
Built Distribution
Close
Hashes for hash_chunker-0.1.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59ffcc15bd119dac18bd51a5a1a694be451deda238209588784c7c64ab0c04df |
|
MD5 | 5d34ae110737641b86e084917778816c |
|
BLAKE2b-256 | f463b7d8c6c5ffed72497be1c3c6a2e010b79c9eb4c881bb60daf3859f567330 |