Generator that yields hash chunks for distributed data processing.
Project description
Python Hash Chunker
Generator that yields hash chunks for distributed data processing.
TLDR
# pip install hash-chunker
from hash_chunker import HashChunker
chunks = list(HashChunker().get_chunks(chunk_size=1000, all_items_count=2000))
assert chunks == [("", "8000000000"), ("8000000000", "ffffffffff")]
# or
hash_chunker = HashChunker(chunk_hash_length=3)
chunks = list(hash_chunker.get_chunks(500, 1500))
assert chunks == [('', '555'), ('555', 'aaa'), ('aaa', 'fff')]
# or
chunks = list(HashChunker().get_fixed_chunks(2))
assert chunks == [("", "8000000000"), ("8000000000", "ffffffffff")]
# use chunks as tasks for multiprocessing
query_part = "hash > :from_hash AND hash <= :to_hash"
params = {"from_hash": chunk[0], "to_hash": chunk[1]}
Description
Imagine a situation when you need to process huge amount data rows in parallel. Each data row has a hash field and the task is to use it for chunking.
Possible reasons for using hash field and not int id field:
- No auto increment id field.
- Field id has many blank lines (1,2,3, 100500, 100501, 1000000).
- Chunking by id will break data that must be in one chunk to different chunks (in user behavioral analytics id can be autoincrement for all users actions and user_session hash is linked to concrete user, so if we chunk by id one user session may not be in one chunk).
Installation
Recommend way to install Hash Chunker is pip.
pip install hash-chunker
Usage
Import Hash Chunker.
from hash_chunker import HashChunker
Create class instance.
hash_chunker = HashChunker()
# or use chunk_hash_length key word arguments to limit generated hashes length
hash_chunker = HashChunker(chunk_hash_length=3)
Get chunks by providing chunk_size and all_items_count.
chunks = list(hash_chunker.get_chunks(chunk_size=500, all_items_count=1500))
# or skip positional arguments names
chunks = list(hash_chunker.get_chunks(500, 1500))
# or use yielded chunks in loop
for chunk in hash_chunker.get_chunks(500, 1500):
print(chunk)
Support
You may report bugs, ask for help, and discuss various other issues on the bug tracker.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hash_chunker-0.1.9.tar.gz
.
File metadata
- Download URL: hash_chunker-0.1.9.tar.gz
- Upload date:
- Size: 4.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.1 CPython/3.10.4 Linux/5.15.0-48-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3719be108c0a5e986cf82f5bb4964e4987ba42e64e3716bcc0664e1248fb0e90 |
|
MD5 | b875a269d443ae775fd2e70e6ae142df |
|
BLAKE2b-256 | 657e04eaa65531882dd279982c0f2a89917b2a51873a45479d12386c497b94b2 |
File details
Details for the file hash_chunker-0.1.9-py3-none-any.whl
.
File metadata
- Download URL: hash_chunker-0.1.9-py3-none-any.whl
- Upload date:
- Size: 4.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.1 CPython/3.10.4 Linux/5.15.0-48-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b800c57735475569c108b0abe086c65868b5e69e258ae3d12b7284c09d1dba6e |
|
MD5 | 2c11f84618b947c475e3f029f9df9e89 |
|
BLAKE2b-256 | b0493f37ae4a1ffc1c5e1ea36e5350d82c1a4a2890dfca3f17135c3068a4d53d |