Skip to main content

High-performance Count-Min Sketch implementation with C++ and Python versions

Project description

Count-Min Sketch

A high-performance, template-based C++ implementation of the Count-Min Sketch probabilistic data structure with Python bindings.

Features

  • Template-Based Design: Supports any hashable key type (strings, integers, etc.)
  • Thread-Safe: Uses atomic operations for concurrent access
  • High Performance: Optimized C++ implementation with efficient memory usage
  • Python Bindings: Easy-to-use Python interface via pybind11
  • Comprehensive Testing: Full test suite with Google Test
  • CMake Build System: Modern, cross-platform build configuration

Project Structure

count-min-sketch/
├── include/cmsketch/           # Public header files
│   ├── cmsketch.h             # Main header (include this)
│   ├── count_min_sketch.h     # Core Count-Min Sketch template class
│   ├── hash_util.h            # Hash utility functions
│   └── version.h              # Version information
├── src/cmsketch/              # C++ source files
│   ├── count_min_sketch.cc    # Core implementation
│   └── version.cc             # Version implementation
├── src/                       # Additional source files
│   ├── main.cc               # Example application
│   └── python_bindings.cc    # Python bindings
├── tests/                     # Unit tests
│   ├── CMakeLists.txt        # Test configuration
│   ├── test_count_min_sketch.cpp
│   ├── test_hash_functions.cpp
│   └── test_sketch_config.cpp
├── docs/                      # Documentation
│   ├── CMakeLists.txt        # Documentation build
│   └── Doxyfile.in           # Doxygen configuration
├── cmake/                     # CMake modules
│   └── cmsketchConfig.cmake.in
├── CMakeLists.txt            # Main build configuration
├── pyproject.toml            # Python package configuration
├── example.py                # Python example
└── build.sh                  # Build script

Building

Prerequisites

  • C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
  • CMake 3.15+
  • Python 3.11+ (for Python bindings)
  • pybind11 (for Python bindings)
  • Google Test (for testing, optional)

Quick Build

# Make build script executable
chmod +x build.sh

# Build everything
./build.sh

Manual Build

# Create build directory
mkdir build && cd build

# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release

# Build
make -j$(nproc)

# Run tests (optional)
make test

# Run example
./cmsketch_example

Python Package

# Install Python dependencies
pip install pybind11 scikit-build-core

# Build Python package
pip install .

Usage

C++ Example

#include "cmsketch/cmsketch.h"
#include <iostream>
#include <vector>

int main() {
    // Create a sketch with width=1000, depth=5
    cmsketch::CountMinSketch<std::string> sketch(1000, 5);
    
    // Add elements
    sketch.Insert("apple");
    sketch.Insert("apple");
    sketch.Insert("apple");
    sketch.Insert("banana");
    sketch.Insert("banana");
    sketch.Insert("apple");
    
    // Query frequencies
    std::cout << "apple: " << sketch.Count("apple") << std::endl;    // 4
    std::cout << "banana: " << sketch.Count("banana") << std::endl;  // 2
    std::cout << "cherry: " << sketch.Count("cherry") << std::endl;  // 0
    
    // Test TopK functionality
    std::vector<std::string> candidates = {"apple", "banana", "cherry"};
    auto top_k = sketch.TopK(2, candidates);
    for (const auto& pair : top_k) {
        std::cout << pair.first << ": " << pair.second << std::endl;
    }
    
    return 0;
}

Python Example

import cmsketch

# Create a sketch
sketch = cmsketch.CountMinSketch(1000, 5)

# Add elements
sketch.insert("apple")
sketch.insert("apple")
sketch.insert("apple")
sketch.insert("banana")
sketch.insert("banana")
sketch.insert("apple")

# Query frequencies
print(f"apple: {sketch.count('apple')}")    # 4
print(f"banana: {sketch.count('banana')}")  # 2
print(f"cherry: {sketch.count('cherry')}")  # 0

# Test TopK functionality
candidates = ["apple", "banana", "cherry"]
top_k = sketch.top_k(2, candidates)
for item, count in top_k:
    print(f"{item}: {count}")

API Reference

Core Classes

  • CountMinSketch<KeyType>: Template-based sketch implementation
  • HashUtil: Hash utility functions
  • Version: Version information

Key Methods

  • Insert(item): Insert an item into the sketch
  • Count(item): Get estimated count of an item
  • Merge(other): Merge another sketch
  • Clear(): Reset sketch to initial state
  • TopK(k, candidates): Get top k items from candidates
  • GetWidth(): Get sketch width
  • GetDepth(): Get sketch depth

Configuration

The sketch is configured with explicit dimensions:

// String keys
cmsketch::CountMinSketch<std::string> sketch(1000, 5);

// Integer keys
cmsketch::CountMinSketch<int> int_sketch(100, 3);

// Int64 keys
cmsketch::CountMinSketch<int64_t> int64_sketch(500, 4);

Error Bounds

The Count-Min Sketch provides the following guarantees:

  • Overestimate: Estimates are always ≥ actual frequency
  • Error Bound: Error is bounded by the sketch dimensions
  • Memory: O(width × depth) counters
  • Thread Safety: Atomic operations ensure thread-safe concurrent access

Testing

Run the test suite:

cd build
make test
# or
./cmsketch_tests

Documentation

Generate API documentation:

cd build
make docs
# Documentation will be in docs/html/

Contributing

  1. Follow Google C++ Style Guide
  2. Add tests for new features
  3. Update documentation as needed
  4. Ensure all tests pass

License

[Add your license here]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cmsketch-0.1.0.tar.gz (43.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cmsketch-0.1.0-cp311-cp311-macosx_14_0_arm64.whl (200.2 kB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

File details

Details for the file cmsketch-0.1.0.tar.gz.

File metadata

  • Download URL: cmsketch-0.1.0.tar.gz
  • Upload date:
  • Size: 43.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.10

File hashes

Hashes for cmsketch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9a2cec3710d31e8ed1198fe5bab64d7b43b6517fa38aac87549ec039bb0fbf91
MD5 39fa28e1c670642673e9e83f197616c5
BLAKE2b-256 516d1ce982e4af7da102e22a98bff009eae7bd4311f211a5d9bfb66f5cf444ce

See more details on using hashes here.

File details

Details for the file cmsketch-0.1.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for cmsketch-0.1.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 3d5bee28b4b0426ec6a433e60539ff8222485a931649e09aa503d4dcf74ed735
MD5 66d0cb4b8ed5fecb5c59d2f6084bce0a
BLAKE2b-256 30ce236ca02492268e205604bb087275bfda679d0ef327305c6b7572b999ea33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page