Skip to main content

Python bindings for the HeavyKeeper algorithm implemented in Rust

Project description

heavykeeper

License: MIT Python

Python bindings for the HeavyKeeper algorithm - a fast, memory-efficient sketch-based algorithm for finding the top-K most frequent items in data streams.

Overview

HeavyKeeper is a probabilistic data structure that identifies the most frequent items in a data stream using minimal memory. This implementation provides Python bindings for a high-performance Rust implementation of the algorithm.

Key Features

  • 🚀 High Performance: Rust-based implementation with Python bindings via PyO3
  • 💾 Memory Efficient: Uses probabilistic sketching to track millions of items with minimal memory
  • 🎯 Top-K Tracking: Efficiently maintains the K most frequent items
  • 🔄 Stream Processing: Designed for continuous data streams
  • 📊 Approximate Counts: Provides estimated frequencies with high accuracy
  • 🧪 Battle Tested: Includes comprehensive benchmarks and tests

Use Cases

  • Log Analysis: Find the most frequent IP addresses, user agents, or error messages
  • Text Processing: Identify the most common words in large documents
  • Network Monitoring: Track heavy hitters in network traffic
  • Clickstream Analysis: Find the most popular pages or user actions
  • Time Series Data: Monitor frequently occurring events or anomalies

Installation

From Source (Development)

# Clone the repository
git clone https://github.com/pmcgleen/heavykeeper-py.git
cd heavykeeper-py

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build and install the Python package
maturin develop

# Or build a wheel
maturin build --release

Requirements

  • Python 3.11+
  • Rust toolchain (for building from source)

Quick Start

from heavykeeper import HeavyKeeper

# Create a HeavyKeeper instance
# k=100: track top 100 items
# width=2048: sketch width (affects accuracy)
# depth=8: number of hash functions (affects accuracy)
# decay=0.9: aging factor for old items
hk = HeavyKeeper(k=100, width=2048, depth=8, decay=0.9)

# Add items to the stream
items = ["apple", "banana", "apple", "cherry", "apple", "banana"]
for item in items:
    hk.add(item)

# Query individual items
print(f"Is 'apple' in top-K? {hk.query('apple')}")
print(f"Estimated count for 'apple': {hk.count('apple')}")

# Get all top-K items
top_items = hk.list()  # Returns list of (item, count) tuples
print("Top items:", top_items)

# Get as dictionary
top_dict = hk.get_topk()  # Returns {item: count} dictionary
print("Top items dict:", top_dict)

API Reference

HeavyKeeper(k, width, depth, decay)

Creates a new HeavyKeeper instance.

Parameters:

  • k (int): Number of top items to track
  • width (int): Width of the sketch (number of buckets)
  • depth (int): Depth of the sketch (number of hash functions)
  • decay (float): Decay factor for aging items (between 0.0 and 1.0)

Methods

add(item: str) -> None

Add an item to the sketch.

query(item: str) -> bool

Check if an item is being tracked in the top-K list.

count(item: str) -> int

Get the estimated count for an item (returns 0 if not tracked).

list() -> List[Tuple[str, int]]

Get the top-K items as a list of (item, count) tuples, sorted by count.

get_topk() -> Dict[str, int]

Get the top-K items as a dictionary mapping items to counts.

len() -> int

Get the current number of items being tracked.

is_empty() -> bool

Check if the sketch is empty.

Benchmarking

The repository includes several benchmark scripts for performance testing:

Word Count Benchmark

# Basic benchmark with a text file
python benchmark_wordcount.py -k 100 -f your_text_file.txt --time

# Using different parsing methods
python benchmark_wordcount.py -k 100 -f large_file.txt --method mmap --time

# Custom parameters
python benchmark_wordcount.py -k 50 -w 4096 -d 4 -y 0.8 -f data.txt --time

Parameter Tuning

Choosing Parameters

  • k: Set to the number of top items you need
  • width: Larger values improve accuracy but use more memory (try 1024-8192)
  • depth: More hash functions improve accuracy (try 4-16)
  • decay: Controls how quickly old items are forgotten (0.8-0.99)

Memory Usage

Approximate memory usage: width × depth × 16 bytes + k × (item_size + 16 bytes)

For typical usage (width=2048, depth=8, k=100):

  • Sketch: ~262 KB
  • Top-K storage: ~depends on item sizes

Accuracy vs Performance

  • Higher width and depth → better accuracy, more memory
  • Lower decay → faster adaptation to changes, less stability
  • Higher k → more items tracked, slightly more overhead

Development

Building

# Development build
maturin develop

# Release build  
maturin build --release

# Build with debugging
maturin develop --debug

Testing

# Run the test suite
python test_heavykeeper.py

# Run benchmarks
python benchmark_wordcount.py -k 10 -f test_file.txt

Project Structure

heavykeeper-py/
├── src/
│   └── lib.rs          # Rust implementation and Python bindings
├── benchmark_*.py      # Performance benchmarks
├── test_heavykeeper.py # Test suite
├── Cargo.toml          # Rust dependencies
├── pyproject.toml      # Python package configuration
└── README.md           # This file

Algorithm Details

HeavyKeeper uses a sketch-based approach with the following components:

  1. Count-Min Sketch: Probabilistic counting structure
  2. Heavy Part: Stores the actual top-K items
  3. Exponential Decay: Ages out old items over time
  4. Hash Functions: Multiple hash functions for better accuracy

The algorithm provides strong theoretical guarantees while maintaining excellent practical performance.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Based on the HeavyKeeper algorithm research
  • Built with PyO3 for Rust-Python interoperability
  • Uses Maturin for building Python extensions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

heavykeeper-0.2.0.tar.gz (12.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

heavykeeper-0.2.0-cp312-cp312-win_amd64.whl (155.2 kB view details)

Uploaded CPython 3.12Windows x86-64

heavykeeper-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (294.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

heavykeeper-0.2.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (294.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

heavykeeper-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (257.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

heavykeeper-0.2.0-cp312-cp312-macosx_10_12_x86_64.whl (266.7 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

heavykeeper-0.2.0-cp311-cp311-win_amd64.whl (155.6 kB view details)

Uploaded CPython 3.11Windows x86-64

heavykeeper-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (295.3 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

heavykeeper-0.2.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (294.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

heavykeeper-0.2.0-cp311-cp311-macosx_11_0_arm64.whl (262.1 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

heavykeeper-0.2.0-cp311-cp311-macosx_10_12_x86_64.whl (270.9 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file heavykeeper-0.2.0.tar.gz.

File metadata

  • Download URL: heavykeeper-0.2.0.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for heavykeeper-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7051bfaf5a7109b24f435553d0958821c046d8868b7c0a96dc1d76315f0926d2
MD5 a060870931b0bee7d11b20f9848aab62
BLAKE2b-256 9727b86b10bf2a9e79205a7170a90d53237901095965dba29c9f948d6dbccec7

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0.tar.gz:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6ab7402c3f70aa1577881e3103c425718ed2bce31f80e45f667277e5565baf59
MD5 2835a0911a16776606e1ccd65586878a
BLAKE2b-256 ad3c79c8622f3c0a7df6748ebece3d9ce629506372dfbf16aa4d9934d90bdbd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0-cp312-cp312-win_amd64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 772f0e45d6817e96e03f80bce2fb5c660178a5efda6acbb52347e45850cf7aa2
MD5 ba1c76c109d18838e28be8fef0d0edf0
BLAKE2b-256 3f3c0d4c7a931ea8df695e122fe7aa3fb89badb918b020674b61ef576e00dda0

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 bfbbdfda28bf3b80b00c523854746a1854302e8e7b4c63e1e9362e2119b98deb
MD5 86837feaa1180fbdc10974ca096df02e
BLAKE2b-256 91f51f06ce8f1f47ff1878ca4abb45492ab2b648d85f4c792eeee4b8a39a1cf2

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e4651f8c4372afa0ef998c269bea1e245ad08f54a0fe557a2da008139bac513b
MD5 39a75ebf2a9810d001a2f0ec91c45943
BLAKE2b-256 3148a49ed7b6170936cc519c20ee9aaff41056cf0b2f6e0b1ab48862ab706463

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9e841a667e55530217c5248725e44bb835e3c31cc7b22cfbecfa3b1690147148
MD5 89e3a374e51e26268498bc106debf25a
BLAKE2b-256 2addaa22626b77395c9ee1860a9ea1146bb222f6b5a80442afb10b6ee71904df

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 0d6998a5800fdf024f6f80fbd6d99bd336b50786b1e454ab89416079096c9c50
MD5 eb23fba1f6c55701340d822f151bb6e0
BLAKE2b-256 d94e357a6a427bf668b78081e6782c126ba308cf8d5efc3acd48ce2168e7b7c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0-cp311-cp311-win_amd64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1431568bafaf8fd0d01d73f8ab7d62defd6149bbf2647dff374fc09798b83d79
MD5 bf57e41cc0102845889283a07a16427f
BLAKE2b-256 640b4d6b93fcbd44501a4e83abb91a7c5e185ba086d4e5d088df0ff7fa815896

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b9c5ea660982d60e8be4d2aa52d1e8de7d984d559179c68c74433ae64c5477ec
MD5 99b58cde35c8c4180edf2c048d927d9e
BLAKE2b-256 9920c1ebda86265b790ad5cfead61c07b60f89ef8a2588ff679dc5a69a34aeda

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b8af3069d9c48b70ab7ab2e54c0440d87acdc10ea7555e30f9d33d68176490c0
MD5 f3a5981073891324290d7416a3d0c7d7
BLAKE2b-256 e42b66acf852f82c2da32876fb135f39b8f1a4e1f2ca4bb7a299be4a47de86c4

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.2.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.2.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ceb350d595dcddbb5a1ecd3c8482ed075e9a8f26ebe4c5e30fa7462964c1bab4
MD5 24d1d9e9ce4a893dbe6cdad59388a60f
BLAKE2b-256 d1de01f3452ad20cc865b62823c2a1812b7a5940124f6c9d8cc75fdffd6453f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.2.0-cp311-cp311-macosx_10_12_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page