Skip to main content

Python bindings for the HeavyKeeper algorithm implemented in Rust

Project description

heavykeeper

License: MIT Python

Python bindings for the HeavyKeeper algorithm - a fast, memory-efficient sketch-based algorithm for finding the top-K most frequent items in data streams.

Overview

HeavyKeeper is a probabilistic data structure that identifies the most frequent items in a data stream using minimal memory. This implementation provides Python bindings for a high-performance Rust implementation of the algorithm.

Key Features

  • 🚀 High Performance: Rust-based implementation with Python bindings via PyO3
  • 💾 Memory Efficient: Uses probabilistic sketching to track millions of items with minimal memory
  • 🎯 Top-K Tracking: Efficiently maintains the K most frequent items
  • 🔄 Stream Processing: Designed for continuous data streams
  • 📊 Approximate Counts: Provides estimated frequencies with high accuracy
  • 🧪 Battle Tested: Includes comprehensive benchmarks and tests

Use Cases

  • Log Analysis: Find the most frequent IP addresses, user agents, or error messages
  • Text Processing: Identify the most common words in large documents
  • Network Monitoring: Track heavy hitters in network traffic
  • Clickstream Analysis: Find the most popular pages or user actions
  • Time Series Data: Monitor frequently occurring events or anomalies

Installation

From Source (Development)

# Clone the repository
git clone https://github.com/pmcgleen/heavykeeper-py.git
cd heavykeeper-py

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build and install the Python package
maturin develop

# Or build a wheel
maturin build --release

Requirements

  • Python 3.11+
  • Rust toolchain (for building from source)

Quick Start

from heavykeeper import HeavyKeeper

# Create a HeavyKeeper instance
# k=100: track top 100 items
# width=2048: sketch width (affects accuracy)
# depth=8: number of hash functions (affects accuracy)
# decay=0.9: aging factor for old items
hk = HeavyKeeper(k=100, width=2048, depth=8, decay=0.9)

# Add items to the stream
items = ["apple", "banana", "apple", "cherry", "apple", "banana"]
for item in items:
    hk.add(item)

# Query individual items
print(f"Is 'apple' in top-K? {hk.query('apple')}")
print(f"Estimated count for 'apple': {hk.count('apple')}")

# Get all top-K items
top_items = hk.list()  # Returns list of (item, count) tuples
print("Top items:", top_items)

# Get as dictionary
top_dict = hk.get_topk()  # Returns {item: count} dictionary
print("Top items dict:", top_dict)

API Reference

HeavyKeeper(k, width, depth, decay)

Creates a new HeavyKeeper instance.

Parameters:

  • k (int): Number of top items to track
  • width (int): Width of the sketch (number of buckets)
  • depth (int): Depth of the sketch (number of hash functions)
  • decay (float): Decay factor for aging items (between 0.0 and 1.0)

Methods

add(item: str) -> None

Add an item to the sketch.

query(item: str) -> bool

Check if an item is being tracked in the top-K list.

count(item: str) -> int

Get the estimated count for an item (returns 0 if not tracked).

list() -> List[Tuple[str, int]]

Get the top-K items as a list of (item, count) tuples, sorted by count.

get_topk() -> Dict[str, int]

Get the top-K items as a dictionary mapping items to counts.

len() -> int

Get the current number of items being tracked.

is_empty() -> bool

Check if the sketch is empty.

Benchmarking

The repository includes a simnple script for performance testing:

Word Count Benchmark

# Basic benchmark with a text file
python benchmark_wordcount.py -k 10 -f data/war_and_peace.txt --time

Parameter Tuning

Choosing Parameters

  • k: Set to the number of top items you need
  • width: Larger values improve accuracy but use more memory (try 1024-8192)
  • depth: More hash functions improve accuracy (try 4-16)
  • decay: Controls how quickly old items are forgotten (0.8-0.99)

Memory Usage

Approximate memory usage: width × depth × 16 bytes + k × (item_size + 16 bytes)

For typical usage (width=2048, depth=8, k=100):

  • Sketch: ~262 KB
  • Top-K storage: ~depends on item sizes

Accuracy vs Performance

  • Higher width and depth → better accuracy, more memory
  • Lower decay → faster adaptation to changes, less stability
  • Higher k → more items tracked, slightly more overhead

Development

Building

# Development build
maturin develop

# Release build  
maturin build --release

# Build with debugging
maturin develop --debug

Testing

# Run the test suite
python test_heavykeeper.py

# Run benchmarks
python benchmark_wordcount.py -k 10 -f test_file.txt

Project Structure

heavykeeper-py/
├── src/
│   └── lib.rs          # Rust implementation and Python bindings
├── benchmark_*.py      # Performance benchmarks
├── test_heavykeeper.py # Test suite
├── Cargo.toml          # Rust dependencies
├── pyproject.toml      # Python package configuration
└── README.md           # This file

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Based on the HeavyKeeper algorithm
  • Built with PyO3 for Rust-Python interoperability
  • Uses Maturin for building Python extensions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

heavykeeper-0.2.2.tar.gz (1.2 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

heavykeeper-0.2.2-cp312-cp312-win_amd64.whl (152.0 kB view details)

Uploaded CPython 3.12Windows x86-64

heavykeeper-0.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (296.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

heavykeeper-0.2.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (287.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

heavykeeper-0.2.2-cp312-cp312-macosx_11_0_arm64.whl (253.3 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

heavykeeper-0.2.2-cp312-cp312-macosx_10_12_x86_64.whl (265.3 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

heavykeeper-0.2.2-cp311-cp311-win_amd64.whl (151.9 kB view details)

Uploaded CPython 3.11Windows x86-64

heavykeeper-0.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (297.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

heavykeeper-0.2.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (287.1 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

heavykeeper-0.2.2-cp311-cp311-macosx_11_0_arm64.whl (255.5 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

heavykeeper-0.2.2-cp311-cp311-macosx_10_12_x86_64.whl (267.3 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file heavykeeper-0.2.2.tar.gz.

File metadata

  • Download URL: heavykeeper-0.2.2.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for heavykeeper-0.2.2.tar.gz
Algorithm Hash digest
SHA256 12c2cfd985ac4d885b9d97a27603e63bbeb5bd02b17ae5a2a33d0f110fc867b2
MD5 e734f8df4b936c3ac33252b58a31390f
BLAKE2b-256 1c19587db2a38533fb3ca463ed115ee8e36e095d237c67e5983cacce46351959

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2.tar.gz:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: heavykeeper-0.2.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 152.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for heavykeeper-0.2.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 e4d6baf78dda8b292b92e4646d27fc29701b1f786e545d488f52b3aba450b8d8
MD5 c6ea18312f8ee696c1070cf39ba7ad3e
BLAKE2b-256 116874f9f4af218b937834d3d49cc068183186b2231a73fa1913863a5366f703

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2-cp312-cp312-win_amd64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 69390152a7c451c19290ae160c930877c80a2ecc2aa63a160eda29e673406aff
MD5 4004a833f8434d3048592036e7b5da9b
BLAKE2b-256 0efcdc76fcfa4b189462c91b8533e383d2d81bb4484768bcc41cc1d4ebdbd9c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d49ce22f90c23cde6082974b84c26c18b43bbb5b0cab2e57f6fe422866034566
MD5 59c764bb4bed6bed7c6930b7aad6f233
BLAKE2b-256 e96054eea49ad0d423348ccfdcc22a86ec22c59415a8a94ad9a068f5babbb827

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c79ecf0a3ae93c222df0be9e50701edcce1a08ed9d73eee0082216f1d524292b
MD5 1e560559c2a1e90b3ce9a76054c785d5
BLAKE2b-256 9706dd59a879a6ac01a6291976815427f9512ebc5ec8406433a2a5bdce6ba62f

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.2-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.2-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3cc405bc48e326d39c94261665b97252fe4d36f592c1cc6c848554b90d198be0
MD5 f7467924ea711822a5cb7c39e1e9b299
BLAKE2b-256 0efe4de77541ce6c003d9d83735d10c3876825a6805200423d095209479c697c

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: heavykeeper-0.2.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 151.9 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for heavykeeper-0.2.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 82d781de8d169a3aac7e37a0ef2374092c84079890155afa82f01f432bbf6296
MD5 03cd79494fd0da40a5626666bfe04217
BLAKE2b-256 8114a48254b25aed408ace4962033ea53af7786ad9dc2c7ae15886d2a1ae8cfa

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2-cp311-cp311-win_amd64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 64e85dd365fc85ec82d4ff834a18eedc868dca470b963953f995f4578b33edd2
MD5 47034a951c3f818fd55937d6ead65a55
BLAKE2b-256 3ff76389d635dc3c3ea5b33d7bad07fdf0cc012fcb5c2aff294657b42cbceb82

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 26d5b4fa0beeecc02f9f4a212994892a2430fc7780be8807a0ab8437eea28ed2
MD5 36b4691e2af42240c3072000d5be644b
BLAKE2b-256 8102a74a015b197d827bd1d442dee815a2a8e99d6a942062d4271a163ee61005

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6dba2ee21f29c86e8907e60d448b7eb22c701fbf44b0ac0d2639f227c607ef6d
MD5 2efeb993279a83324e633c7ed942fc57
BLAKE2b-256 0789761b9694035455a55a9ec2743ca2cd2fc3a97ce0f755e48c7bc40d74c1e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.2-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.2-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f8d3c01c1a9cc99c829c336510dbe9231cd38eb0be393defd08cddefd61234e5
MD5 e60056f3859b9c3da05ef602bc834b17
BLAKE2b-256 04ea858a0a9a0f2a5674c06b15380dd8a62ea83e06e4976b2b525d9ba49a7576

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.2-cp311-cp311-macosx_10_12_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page