Skip to main content

Python bindings for the HeavyKeeper algorithm implemented in Rust

Project description

heavykeeper

License: MIT Python

Python bindings for the HeavyKeeper algorithm - a fast, memory-efficient sketch-based algorithm for finding the top-K most frequent items in data streams.

Overview

HeavyKeeper is a probabilistic data structure that identifies the most frequent items in a data stream using minimal memory. This implementation provides Python bindings for a high-performance Rust implementation of the algorithm.

Key Features

  • 🚀 High Performance: Rust-based implementation with Python bindings via PyO3
  • 💾 Memory Efficient: Uses probabilistic sketching to track millions of items with minimal memory
  • 🎯 Top-K Tracking: Efficiently maintains the K most frequent items
  • 🔄 Stream Processing: Designed for continuous data streams
  • 📊 Approximate Counts: Provides estimated frequencies with high accuracy
  • 🧪 Battle Tested: Includes comprehensive benchmarks and tests

Use Cases

  • Log Analysis: Find the most frequent IP addresses, user agents, or error messages
  • Text Processing: Identify the most common words in large documents
  • Network Monitoring: Track heavy hitters in network traffic
  • Clickstream Analysis: Find the most popular pages or user actions
  • Time Series Data: Monitor frequently occurring events or anomalies

Installation

From Source (Development)

# Clone the repository
git clone https://github.com/pmcgleen/heavykeeper-py.git
cd heavykeeper-py

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build and install the Python package
maturin develop

# Or build a wheel
maturin build --release

Requirements

  • Python 3.7+
  • Rust toolchain (for building from source)

Quick Start

from heavykeeper import HeavyKeeper

# Create a HeavyKeeper instance
# k=100: track top 100 items
# width=2048: sketch width (affects accuracy)
# depth=8: number of hash functions (affects accuracy)
# decay=0.9: aging factor for old items
hk = HeavyKeeper(k=100, width=2048, depth=8, decay=0.9)

# Add items to the stream
items = ["apple", "banana", "apple", "cherry", "apple", "banana"]
for item in items:
    hk.add(item)

# Query individual items
print(f"Is 'apple' in top-K? {hk.query('apple')}")
print(f"Estimated count for 'apple': {hk.count('apple')}")

# Get all top-K items
top_items = hk.list()  # Returns list of (item, count) tuples
print("Top items:", top_items)

# Get as dictionary
top_dict = hk.get_topk()  # Returns {item: count} dictionary
print("Top items dict:", top_dict)

API Reference

HeavyKeeper(k, width, depth, decay)

Creates a new HeavyKeeper instance.

Parameters:

  • k (int): Number of top items to track
  • width (int): Width of the sketch (number of buckets)
  • depth (int): Depth of the sketch (number of hash functions)
  • decay (float): Decay factor for aging items (between 0.0 and 1.0)

Methods

add(item: str) -> None

Add an item to the sketch.

query(item: str) -> bool

Check if an item is being tracked in the top-K list.

count(item: str) -> int

Get the estimated count for an item (returns 0 if not tracked).

list() -> List[Tuple[str, int]]

Get the top-K items as a list of (item, count) tuples, sorted by count.

get_topk() -> Dict[str, int]

Get the top-K items as a dictionary mapping items to counts.

len() -> int

Get the current number of items being tracked.

is_empty() -> bool

Check if the sketch is empty.

Benchmarking

The repository includes several benchmark scripts for performance testing:

Word Count Benchmark

# Basic benchmark with a text file
python benchmark_wordcount.py -k 100 -f your_text_file.txt --time

# Using different parsing methods
python benchmark_wordcount.py -k 100 -f large_file.txt --method mmap --time

# Custom parameters
python benchmark_wordcount.py -k 50 -w 4096 -d 4 -y 0.8 -f data.txt --time

Parameter Tuning

Choosing Parameters

  • k: Set to the number of top items you need
  • width: Larger values improve accuracy but use more memory (try 1024-8192)
  • depth: More hash functions improve accuracy (try 4-16)
  • decay: Controls how quickly old items are forgotten (0.8-0.99)

Memory Usage

Approximate memory usage: width × depth × 16 bytes + k × (item_size + 16 bytes)

For typical usage (width=2048, depth=8, k=100):

  • Sketch: ~262 KB
  • Top-K storage: ~depends on item sizes

Accuracy vs Performance

  • Higher width and depth → better accuracy, more memory
  • Lower decay → faster adaptation to changes, less stability
  • Higher k → more items tracked, slightly more overhead

Development

Building

# Development build
maturin develop

# Release build  
maturin build --release

# Build with debugging
maturin develop --debug

Testing

# Run the test suite
python test_heavykeeper.py

# Run benchmarks
python benchmark_wordcount.py -k 10 -f test_file.txt

Project Structure

heavykeeper-py/
├── src/
│   └── lib.rs          # Rust implementation and Python bindings
├── benchmark_*.py      # Performance benchmarks
├── test_heavykeeper.py # Test suite
├── Cargo.toml          # Rust dependencies
├── pyproject.toml      # Python package configuration
└── README.md           # This file

Algorithm Details

HeavyKeeper uses a sketch-based approach with the following components:

  1. Count-Min Sketch: Probabilistic counting structure
  2. Heavy Part: Stores the actual top-K items
  3. Exponential Decay: Ages out old items over time
  4. Hash Functions: Multiple hash functions for better accuracy

The algorithm provides strong theoretical guarantees while maintaining excellent practical performance.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Based on the HeavyKeeper algorithm research
  • Built with PyO3 for Rust-Python interoperability
  • Uses Maturin for building Python extensions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

heavykeeper-0.1.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

heavykeeper-0.1.0-cp312-cp312-win_amd64.whl (155.2 kB view details)

Uploaded CPython 3.12Windows x86-64

heavykeeper-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (294.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

heavykeeper-0.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (294.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

heavykeeper-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (257.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

heavykeeper-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl (266.7 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

heavykeeper-0.1.0-cp311-cp311-win_amd64.whl (155.6 kB view details)

Uploaded CPython 3.11Windows x86-64

heavykeeper-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (295.3 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

heavykeeper-0.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (294.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

heavykeeper-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (262.1 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

heavykeeper-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl (270.9 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

heavykeeper-0.1.0-cp310-cp310-win_amd64.whl (155.3 kB view details)

Uploaded CPython 3.10Windows x86-64

heavykeeper-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (295.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

heavykeeper-0.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (294.2 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

heavykeeper-0.1.0-cp310-cp310-macosx_11_0_arm64.whl (262.1 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

heavykeeper-0.1.0-cp310-cp310-macosx_10_12_x86_64.whl (270.9 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file heavykeeper-0.1.0.tar.gz.

File metadata

  • Download URL: heavykeeper-0.1.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for heavykeeper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 411be7b4b292f34ecfc940e47de611edf5f38a650e5a79507ed2529328f2ced5
MD5 b558fca933d8a754029166112519a5e5
BLAKE2b-256 3989ef7604ffda883fd27e85fa9f0028c5aac7533d142f23d6a01850d22d8f7c

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0.tar.gz:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 de2d0ddffd1583222712e0e4561cd1fee1d725f1b2e1a4c6ee623e8ca5ca57c2
MD5 e1adf79fd0ea285f3135da703ed7ad56
BLAKE2b-256 739ee542f14bfa71cead832ac91cb5f590abfa1af35bb48cacb54f8c4c9fc8d7

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp312-cp312-win_amd64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 00eefecbda9a50cbcb988d5b9d5d4c9e7afba4d2e81de31a7b6e3cf548ac37ee
MD5 4a7741e9428c0b141777c90ae6a54a3a
BLAKE2b-256 3ae41cf87a7d83d605397b0c5d0088f768770af2cdaa41414365d84eb3e69414

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 abe12d92a4298ce9d5f576c62f5288139e70c8f698e56f71c2afc4ac6e5fb890
MD5 db73b6af7f960f9c30d611d671487539
BLAKE2b-256 e40cc61222da507a12f9a16328d3f5722552cfb461dee417b68d08e1fdf26111

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5bd71f9481594658473e03e320ac258134049eae3260e2702ddb0e72c76c123c
MD5 db2661b4240fe7daa7ceb8148701b009
BLAKE2b-256 37810c8619963225448e855a67e11a7559508b0103d3071a9361ba608464aa08

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b73c44990714e3f08036f79282938a070fa5f0baad6928b4f5a741c3275b64cc
MD5 8092320e7aac9722920589e8ada01cfb
BLAKE2b-256 a92ab1d632b2f9b9de2cb89eaf80c478b18bfab5cf04aed34948adb05db16dfa

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b03e1f6e249243a968122e33c642ec1ba85c5221bb21d18b3dd6c29bf2f33eca
MD5 1acc642e13a9c4edfce587b828e74751
BLAKE2b-256 704b641ea2e7a9310fde9dd2bf87080adfbd40a41d5a229ea78ed3de9580e202

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp311-cp311-win_amd64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0e15b38e4036d271029466847e84d731e598839f023158b351c69f248eb4bad7
MD5 dc5123f68a108daf9a31020e1aedb0b5
BLAKE2b-256 99be9b005a8886bf58a43e060e6675057c4e7db2571337e4624888b7b70859e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6af9ed580ef478d0062260a0ab33764120290020c0dbf1d31dec4bb2870456ba
MD5 134a3f2fa758fd5fef939f62971bff53
BLAKE2b-256 bd655594e50b4f42d1e94dd9bdd1e0e23caf87ce3228b8e0eb1e406ccbecd9e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d37b467d769f89ee00dab37368aa3b90c505544436cd88403e4f3486b5107bcc
MD5 f0e9a48047f56515d3f95c93817e2cfe
BLAKE2b-256 984f29689c77d5abb087e776ae031c9ee8c5ff5196b2720b3c69b50dcf66625a

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9709347cd17a51ae4d9a4795c26f077e415d7e7f2db2281ddb96fb9d05f41718
MD5 e9218416be5f15773aff5166ed04fdf9
BLAKE2b-256 a86f16819d94acbebd25e97f89c6cac8ea5702e0d99f65234ee78e4f6d5ce62f

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a9a7a1c436fb8f83c0c1e9bb5bec804b62146defadc7f740e65ff9e7f5b92834
MD5 21b2e82f93ac00e2db4e0dd162d8b2ff
BLAKE2b-256 cc2d55e86d5c94a5ade64519489bdef8eb744a5cf4dab1978871d2a00d04150a

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp310-cp310-win_amd64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ff7e35b0ecb22c3f073bc5855b7137a0f993482be89bfbe422d55b0d270da641
MD5 d3e2ece74bdba328fc5907c92200bfaa
BLAKE2b-256 b1bc5a21da0d3c95846a2a8dfc6eeec1f64304d621059fc6a41bd03cda769b37

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9e24bc863c9eea64e38b0d97d73d455aa20df18362d403638a41d820613c9035
MD5 950cd968a35c6d8fecca735598d1dd60
BLAKE2b-256 daf997936c6562379453a6f030b08c6615fffa20be56cbeb9a37f376ba57d21c

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fd6ef3e5888518abb14a2bc31b59ccff2ac01246f9d1486b2935c96ff1927801
MD5 a5f13f6574698c0d9077cca1f6f26f33
BLAKE2b-256 1abf935dffe285dddb7a92b63c0babc34fc89e551a0a3df46cf033f748385232

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file heavykeeper-0.1.0-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for heavykeeper-0.1.0-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ec9f4d0cea8b404eedf4bd181dc0a7a3c12f780c4ff3194dca1ebe814af65df7
MD5 48af3e164d4903d4a0135873149b3f14
BLAKE2b-256 a7bd633a60b4ee330306fb6c75cb8c400b8b999cb7cb052c7be696eedef52789

See more details on using hashes here.

Provenance

The following attestation bundles were made for heavykeeper-0.1.0-cp310-cp310-macosx_10_12_x86_64.whl:

Publisher: release.yml on pmcgleenon/heavykeeper-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page