Badger Batcher contains useful utilities for batching a sequence on records
Project description
Badger Batcher
Badger Batcher contains useful utilities for batching a sequence on records
Free software: MIT license
Documentation: https://badger-batcher.readthedocs.io.
Installation
$ pip install badger_batcher
Features
Import Batcher:
>>> from badger_batcher import Batcher
Split records based max limit for batch len:
>>> records = [f"record: {rec}" for rec in range(5)]
>>> batcher = Batcher(records, max_batch_size=2)
>>> batcher.batches()
[['record: 0', 'record: 1'], ['record: 2', 'record: 3'], ['record: 4']]
Split records with max limit for batch len and max limit for record size:
>>> records = [b"aaaa", b"bb", b"ccccc", b"d"]
>>> batcher = Batcher(
... records,
... max_batch_size=2,
... max_record_size=4,
... size_calc_fn=len,
... when_record_size_exceeded="skip",
... )
>>> batcher.batches()
[[b'aaaa', b'bb'], [b'd']]
Split records with max batch len and size:
>>> records = [b"a", b"a", b"a", b"b", b"ccc", b"toolargeforbatch", b"dd", b"e"]
>>> batcher = Batcher(
... records,
... max_batch_len=3,
... max_batch_size=5,
... size_calc_fn=len,
... when_record_size_exceeded="skip",
... )
>>> batcher.batches()
[[b'a', b'a', b'a'], [b'b', b'ccc'], [b'dd', b'e']]
When processing big chunks of data, consider iterating instead:
>>> import sys
>>> records = (f"record: {rec}" for rec in range(sys.maxsize))
>>> batcher = Batcher(records, max_batch_size=2)
>>> for batch in batcher:
... # do something for each batch
... some_fancy_fn(batch)
If you need to encode records before applying the batcher, just encode it before applying. Batcher will not eagerly realize the whole iterable, so use a generator for bigger iterables.
>>> records = ["a", "a", "a", "b", "ccc", "bbbb", "dd", "e"]
>>> encoded_records_gen = (record.encode("utf-16-le") for record in records)
>>> batcher = Batcher(
... encoded_records_gen,
... max_batch_len=3,
... max_record_size=6,
... max_batch_size=10,
... size_calc_fn=len,
... when_record_size_exceeded="skip",
... )
>>> batched_records = batcher.batches()
[
[b"a\x00", b"a\x00", b"a\x00"],
[b"b\x00", b"c\x00c\x00c\x00"],
[b"d\x00d\x00", b"e\x00"],
]
Full example for e.g. Kinesis Streams like processing
import random
from badger_batcher import Batcher
def get_records():
records = (
f"""{{'id': '{i}', 'body': {('x' * random.randint(100_000, 7_000_000))}}}"""
for i in range(10_000)
)
return records
records = get_records()
encoded_records = (record.encode("utf-8") for record in records)
batcher = Batcher(
encoded_records,
max_batch_len=500,
max_record_size=1000 * 1000,
max_batch_size=5 * 1000 * 1000,
size_calc_fn=len,
when_record_size_exceeded="skip",
)
for i, batch in enumerate(batcher):
# do something
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
History
0.1.0 (2021-04-09)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file badger_batcher-0.4.0.tar.gz
.
File metadata
- Download URL: badger_batcher-0.4.0.tar.gz
- Upload date:
- Size: 19.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1eb537d0c12fd990de19341739e52c51fb15d8a1afc3ad4e6ca4b2f4a3d89bc6 |
|
MD5 | 7aebc72f878d7f0f647212209c185931 |
|
BLAKE2b-256 | d8ad46929eb77d305862e633a68f2bd711fd763010459dd1243594540403dd47 |