Skip to main content

Generator that yields hash chunks for distributed data processing.

Project description

Python Hash Chunker

Generator that yields hash chunks for distributed data processing.

TLDR

# pip install hash-chunker
from hash_chunker import HashChunker

chunks = list(HashChunker().get_chunks(chunk_size=1000, all_items_count=2000))
assert chunks == [("", "8000000000"), ("8000000000", "ffffffffff")]

# or
hash_chunker = HashChunker(chunk_hash_length=3)
chunks = list(hash_chunker.get_chunks(500, 1500))
assert chunks == [('', '555'), ('555', 'aaa'), ('aaa', 'fff')]

# or
chunks = list(HashChunker().get_fixed_chunks(2))
assert chunks == [("", "8000000000"), ("8000000000", "ffffffffff")]

# use chunks as tasks for multiprocessing
query_part = "hash > :from_hash AND hash <= :to_hash"
params = {"from_hash": chunk[0], "to_hash": chunk[1]}

Description

Imagine a situation when you need to process huge amount data rows in parallel. Each data row has a hash field and the task is to use it for chunking.

Possible reasons for using hash field and not int id field:

  • No auto increment id field.
  • Field id has many blank lines (1,2,3, 100500, 100501, 1000000).
  • Chunking by id will break data that must be in one chunk to different chunks (in user behavioral analytics id can be autoincrement for all users actions and user_session hash is linked to concrete user, so if we chunk by id one user session may not be in one chunk).

Installation

Recommend way to install Hash Chunker is pip.

pip install hash-chunker

Usage

Import Hash Chunker.

from hash_chunker import HashChunker

Create class instance.

hash_chunker = HashChunker()

# or use chunk_hash_length key word arguments to limit generated hashes length
hash_chunker = HashChunker(chunk_hash_length=3)

Get chunks by providing chunk_size and all_items_count.

chunks = list(hash_chunker.get_chunks(chunk_size=500, all_items_count=1500))

# or skip positional arguments names
chunks = list(hash_chunker.get_chunks(500, 1500))

# or use yielded chunks in loop
for chunk in hash_chunker.get_chunks(500, 1500):
    print(chunk)

Support

You may report bugs, ask for help, and discuss various other issues on the bug tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hash_chunker-0.1.9.tar.gz (4.1 kB view details)

Uploaded Source

Built Distribution

hash_chunker-0.1.9-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file hash_chunker-0.1.9.tar.gz.

File metadata

  • Download URL: hash_chunker-0.1.9.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.1 CPython/3.10.4 Linux/5.15.0-48-generic

File hashes

Hashes for hash_chunker-0.1.9.tar.gz
Algorithm Hash digest
SHA256 3719be108c0a5e986cf82f5bb4964e4987ba42e64e3716bcc0664e1248fb0e90
MD5 b875a269d443ae775fd2e70e6ae142df
BLAKE2b-256 657e04eaa65531882dd279982c0f2a89917b2a51873a45479d12386c497b94b2

See more details on using hashes here.

File details

Details for the file hash_chunker-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: hash_chunker-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 4.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.1 CPython/3.10.4 Linux/5.15.0-48-generic

File hashes

Hashes for hash_chunker-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 b800c57735475569c108b0abe086c65868b5e69e258ae3d12b7284c09d1dba6e
MD5 2c11f84618b947c475e3f029f9df9e89
BLAKE2b-256 b0493f37ae4a1ffc1c5e1ea36e5350d82c1a4a2890dfca3f17135c3068a4d53d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page