Skip to main content

High-performance Count-Min Sketch implementation with C++ and Python versions

Project description

Count-Min Sketch

A high-performance C++ implementation of the Count-Min Sketch probabilistic data structure with Python bindings.

Python Package License: MIT C++17

Project Purpose

This project serves as an educational exploration of:

  • Python Package Development: Building Python packages with C++ implementations using modern tools (pybind11, scikit-build-core, uv)
  • Performance Comparison: Comparing C++ and Python native implementations of the same algorithm
  • Build & Publishing Pipeline: Complete workflow from C++ development to Python package distribution
  • Modern C++ Features: Template-based design, thread safety, and CMake integration

The implementation is inspired by the CMU 15-445/645 Database Systems course Project #0, which focuses on implementing a Count-Min Sketch data structure. This project extends that educational foundation by exploring how to package C++ implementations for Python consumption and comparing performance characteristics.

What is Count-Min Sketch?

The Count-Min Sketch is a probabilistic data structure that provides approximate frequency counts for items in a stream. It's particularly useful for:

Learn more: Count-Min Sketch on Wikipedia

  • Streaming data analysis - Process large datasets without storing all items
  • Frequency estimation - Get approximate counts with bounded error
  • Memory efficiency - O(width ร— depth) space complexity
  • Real-time applications - Fast insertions and queries

Features

  • โšก High Performance - Optimized C++ with atomic operations for thread safety
  • ๐Ÿ”ง Template-Based - Supports any hashable key type (strings, integers, etc.)
  • ๐Ÿ Python Bindings - Easy-to-use Python interface via pybind11
  • ๐Ÿงต Thread-Safe - Concurrent access with atomic operations
  • ๐ŸŒ Cross-Platform - Works on Linux, macOS, and Windows
  • ๐Ÿ“ฆ Easy Installation - Available on PyPI

Quick Start

Installation

# Using pip
pip install cmsketch

# Using uv (recommended)
uv add cmsketch

Basic Usage

import cmsketch

# Create a sketch for strings
sketch = cmsketch.CountMinSketchStr(1000, 5)

# Add elements
sketch.insert("apple")
sketch.insert("apple")
sketch.insert("banana")

# Query frequencies
print(f"apple: {sketch.count('apple')}")    # 2
print(f"banana: {sketch.count('banana')}")  # 1
print(f"cherry: {sketch.count('cherry')}")  # 0

# Get top-k items
candidates = ["apple", "banana", "cherry"]
top_k = sketch.top_k(2, candidates)
for item, count in top_k:
    print(f"{item}: {count}")

C++ Usage

#include "cmsketch/cmsketch.h"
#include <iostream>

int main() {
    // Create a sketch
    cmsketch::CountMinSketch<std::string> sketch(1000, 5);
    
    // Add elements
    sketch.Insert("apple");
    sketch.Insert("apple");
    sketch.Insert("banana");
    
    // Query frequencies
    std::cout << "apple: " << sketch.Count("apple") << std::endl;    // 2
    std::cout << "banana: " << sketch.Count("banana") << std::endl;  // 1
    std::cout << "cherry: " << sketch.Count("cherry") << std::endl;  // 0
    
    return 0;
}

API Reference

Python Classes

Class Description
CountMinSketchStr String-based sketch
CountMinSketchInt Integer-based sketch

Key Methods

Method Description
insert(item) Insert an item into the sketch
count(item) Get estimated count of an item
top_k(k, candidates) Get top k items from candidates
merge(other) Merge another sketch
clear() Reset sketch to initial state
get_width() Get sketch width
get_depth() Get sketch depth

Configuration

The sketch is configured with two parameters:

  • Width: Number of counters per hash function (higher = more accurate)
  • Depth: Number of hash functions (higher = more accurate)
# More accurate but uses more memory
sketch = cmsketch.CountMinSketchStr(10000, 7)

# Less accurate but uses less memory  
sketch = cmsketch.CountMinSketchStr(1000, 3)

Error Bounds

The Count-Min Sketch provides the following guarantees:

  • Overestimate: Estimates are always โ‰ฅ actual frequency
  • Error Bound: Error is bounded by sketch dimensions
  • Memory: O(width ร— depth) counters
  • Thread Safety: Atomic operations ensure concurrent access

Performance

The C++ implementation provides substantial performance improvements over the Python implementation, as demonstrated by comprehensive benchmarks using 100,000 IP address samples with realistic frequency distributions.

Benchmark Results

Based on actual performance testing with 100k IP address samples:

Insertion Performance

  • C++: 45.79ms (21.8 ops/sec) for 100k threaded inserts
  • Python: 8,751ms (0.11 ops/sec) for 100k threaded inserts
  • Speedup: ~191x faster in C++

Query Performance

  • C++: 4.71ฮผs per count operation (212k ops/sec)
  • Python: 858.58ฮผs per count operation (1.2k ops/sec)
  • Speedup: ~182x faster in C++

Top-K Performance

  • C++: 2.57ฮผs per top-k operation (389k ops/sec)
  • Python: 857.54ฮผs per top-k operation (1.2k ops/sec)
  • Speedup: ~334x faster in C++

Streaming Performance (Insert + Top-K)

  • C++: 46.03ms for complete 100k insert + top-k pipeline
  • Python: 8,890ms for complete 100k insert + top-k pipeline
  • Speedup: ~193x faster in C++

Performance Analysis

The dramatic performance differences stem from several key factors:

1. Memory Access Patterns

  • C++: Direct memory access with optimized data structures
  • Python: Object overhead and dynamic typing add significant overhead

2. Hash Function Efficiency

  • C++: Compiled hash functions with minimal overhead
  • Python: Interpreted hash operations with additional function call costs

3. Thread Safety Implementation

  • C++: Native atomic operations for thread-safe counters
  • Python: GIL limitations and additional synchronization overhead

4. Memory Efficiency

  • C++: Compact memory layout with minimal overhead
  • Python: Python object overhead (~24-40 bytes per object vs 4-8 bytes for primitives)

5. Compilation vs Interpretation

  • C++: Optimized machine code generated by compiler
  • Python: Bytecode interpretation with runtime overhead

The results demonstrate that for high-frequency operations like counting and streaming analytics, the C++ implementation provides substantial performance benefits that scale with data volume and operation frequency.

Benchmark Suite

The project includes a comprehensive benchmark suite that tests real-world scenarios:

Test Data

  • 100,000 IP address samples generated using Faker with weighted distribution (10 unique IPs)
  • Realistic frequency patterns (most frequent IP appears ~10% of the time)
  • Threaded processing with 10 concurrent workers and 1,000-item batches

Benchmark Categories

Category Description Tests
Insert Bulk insertion performance C++ vs Python with 100k threaded inserts
Count Query performance Frequency counting for all unique items
Top-K Top-k retrieval Finding top 3 most frequent items
Streaming End-to-end workflows Complete insert + top-k pipeline

Running Benchmarks

# Run benchmarks with pytest
uv run pytest ./benchmarks

Benchmark Features

  • Realistic data: Uses Faker-generated IP addresses with weighted distributions (10 unique IPs)
  • Threaded testing: Tests concurrent access patterns with 10 workers and 1k-item batches
  • Comparative analysis: Direct C++ vs Python performance comparison across all operations
  • Statistical accuracy: Uses pytest-benchmark for reliable measurements with multiple rounds
  • Automated data generation: Creates test data if missing with consistent distributions
  • Comprehensive coverage: Tests insertion, counting, top-k, and streaming workflows

Building from Source

Prerequisites

  • C++17 compatible compiler
  • CMake 3.15+
  • Python 3.11+ (for Python bindings)
  • pybind11 (for Python bindings)

Quick Build

# Clone the repository
git clone https://github.com/isaac-fate/count-min-sketch.git
cd count-min-sketch

# Build everything
make build

# Run tests
make test

# Run example
make example

Development Setup

# Clone the repository
git clone https://github.com/isaac-fate/count-min-sketch.git
cd count-min-sketch

# Install all dependencies (including dev dependencies)
uv sync --dev

# Build the C++ library and Python bindings
uv run python -m pip install -e .

# Run Python tests
uv run pytest pytests/

# Run C++ tests
make build-dev
cd build && make test

# Run benchmarks
uv run pytest ./benchmarks

GitHub Actions

This project uses GitHub Actions for automated CI/CD workflows:

Workflows

  • test.yml: Runs C++ and Python tests on all platforms
  • wheels.yml: Builds wheels for Windows, Linux, and macOS using cibuildwheel
  • release.yml: Automatically publishes wheels to PyPI on release

Supported Platforms

  • Python Versions: 3.11 and 3.12
  • Architectures:
    • Windows: x86_64
    • Linux: x86_64
    • macOS: Intel (x86_64) and Apple Silicon (arm64)

Triggering Workflows

# Push to trigger tests and wheel builds
git push origin main

# Create a release to upload all wheels to PyPI
git tag v0.1.0
git push origin v0.1.0

Workflow Features

  • Cross-Platform Compilation: Uses cibuildwheel for consistent wheel building
  • Dependency Management: Automated dependency installation and caching
  • Test Coverage: Comprehensive testing across all supported platforms
  • Automated Publishing: PyPI upload on release

Project Structure

count-min-sketch/
โ”œโ”€โ”€ include/cmsketch/                    # C++ header files
โ”‚   โ”œโ”€โ”€ cmsketch.h                      # Main header (include this)
โ”‚   โ”œโ”€โ”€ count_min_sketch.h              # Core Count-Min Sketch template class
โ”‚   โ””โ”€โ”€ hash_util.h                     # Hash utility functions
โ”œโ”€โ”€ src/cmsketchcpp/                    # C++ source files
โ”‚   โ””โ”€โ”€ count_min_sketch.cc             # Core implementation
โ”œโ”€โ”€ src/cmsketch/                       # Python package source
โ”‚   โ”œโ”€โ”€ __init__.py                     # Package initialization
โ”‚   โ”œโ”€โ”€ base.py                         # Base classes and interfaces
โ”‚   โ”œโ”€โ”€ _core.pyi                       # Type stubs for C++ bindings
โ”‚   โ”œโ”€โ”€ _version.py                     # Version information
โ”‚   โ”œโ”€โ”€ py.typed                        # Type checking marker
โ”‚   โ””โ”€โ”€ py/                             # Pure Python implementations
โ”‚       โ”œโ”€โ”€ count_min_sketch.py         # Python Count-Min Sketch implementation
โ”‚       โ””โ”€โ”€ hash_util.py                # Python hash utilities
โ”œโ”€โ”€ src/                                # Additional source files
โ”‚   โ”œโ”€โ”€ main.cc                         # Example C++ application
โ”‚   โ””โ”€โ”€ python_bindings.cc              # Python bindings (pybind11)
โ”œโ”€โ”€ tests/                              # C++ unit tests
โ”‚   โ”œโ”€โ”€ CMakeLists.txt                  # Test configuration
โ”‚   โ”œโ”€โ”€ test_count_min_sketch.cc        # Core functionality tests
โ”‚   โ”œโ”€โ”€ test_hash_functions.cc          # Hash function tests
โ”‚   โ””โ”€โ”€ test_sketch_config.cc           # Configuration tests
โ”œโ”€โ”€ pytests/                            # Python tests
โ”‚   โ”œโ”€โ”€ __init__.py                     # Test package init
โ”‚   โ”œโ”€โ”€ conftest.py                     # Pytest configuration
โ”‚   โ”œโ”€โ”€ test_count_min_sketch.py        # Core Python tests
โ”‚   โ”œโ”€โ”€ test_hash_util.py               # Hash utility tests
โ”‚   โ”œโ”€โ”€ test_mixins.py                  # Mixin class tests
โ”‚   โ””โ”€โ”€ test_py_count_min_sketch.py     # Pure Python implementation tests
โ”œโ”€โ”€ benchmarks/                         # Performance benchmarks
โ”‚   โ”œโ”€โ”€ __init__.py                     # Benchmark package init
โ”‚   โ”œโ”€โ”€ generate_data.py                # Data generation utilities
โ”‚   โ””โ”€โ”€ test_benchmarks.py              # Benchmark validation tests
โ”œโ”€โ”€ examples/                           # Example scripts
โ”‚   โ””โ”€โ”€ example.py                      # Python usage example
โ”œโ”€โ”€ scripts/                            # Build and deployment scripts
โ”‚   โ”œโ”€โ”€ build.sh                        # Production build script
โ”‚   โ””โ”€โ”€ build-dev.sh                    # Development build script
โ”œโ”€โ”€ data/                               # Sample data files
โ”‚   โ”œโ”€โ”€ ips.txt                         # IP address sample data
โ”‚   โ””โ”€โ”€ unique-ips.txt                  # Unique IP sample data
โ”œโ”€โ”€ build/                              # Build artifacts (generated)
โ”‚   โ”œโ”€โ”€ _core.cpython-*.so              # Compiled Python extensions
โ”‚   โ”œโ”€โ”€ cmsketch_example                # Compiled C++ example
โ”‚   โ”œโ”€โ”€ libcmsketch.a                   # Static library
โ”‚   โ””โ”€โ”€ tests/                          # Compiled test binaries
โ”œโ”€โ”€ dist/                               # Distribution packages (generated)
โ”‚   โ””โ”€โ”€ cmsketch-*.whl                  # Python wheel packages
โ”œโ”€โ”€ CMakeLists.txt                      # Main CMake configuration
โ”œโ”€โ”€ pyproject.toml                      # Python package configuration
โ”œโ”€โ”€ uv.lock                             # uv lock file
โ”œโ”€โ”€ Makefile                            # Convenience make targets
โ”œโ”€โ”€ LICENSE                             # MIT License
โ””โ”€โ”€ README.md                           # This file

Educational Value

This project demonstrates several important software engineering concepts:

1. Python Package Development with C++ Extensions

  • pybind11 Integration: Seamless C++ to Python binding generation
  • scikit-build-core: Modern Python build system for C++ extensions
  • uv Package Management: Fast, modern Python package management
  • Type Stubs: Complete type information for Python IDEs

2. Performance Engineering

  • C++ vs Python: Direct performance comparison between implementations
  • Memory Efficiency: Optimized data structures and memory usage patterns
  • Thread Safety: Atomic operations and concurrent access patterns
  • Benchmarking: Comprehensive performance testing and profiling

3. Build System Integration

  • CMake: Cross-platform C++ build configuration
  • Python Packaging: Complete pip-installable package creation
  • CI/CD: Automated testing and publishing workflows
  • Cross-Platform: Support for multiple operating systems and architectures

4. Modern C++ Practices

  • Template Metaprogramming: Generic, type-safe implementations
  • RAII: Resource management and exception safety
  • STL Integration: Standard library containers and algorithms
  • Google Style Guide: Consistent, readable code formatting

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Follow Google C++ Style Guide
  4. Add tests for new features
  5. Ensure all tests pass
  6. Submit a pull request

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cmsketch-0.1.11.tar.gz (35.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cmsketch-0.1.11-cp312-cp312-win_amd64.whl (290.6 kB view details)

Uploaded CPython 3.12Windows x86-64

cmsketch-0.1.11-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (113.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

cmsketch-0.1.11-cp312-cp312-macosx_11_0_arm64.whl (105.1 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

cmsketch-0.1.11-cp312-cp312-macosx_10_15_x86_64.whl (112.5 kB view details)

Uploaded CPython 3.12macOS 10.15+ x86-64

cmsketch-0.1.11-cp311-cp311-win_amd64.whl (289.8 kB view details)

Uploaded CPython 3.11Windows x86-64

cmsketch-0.1.11-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (113.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

cmsketch-0.1.11-cp311-cp311-macosx_11_0_arm64.whl (104.3 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

cmsketch-0.1.11-cp311-cp311-macosx_10_15_x86_64.whl (111.0 kB view details)

Uploaded CPython 3.11macOS 10.15+ x86-64

File details

Details for the file cmsketch-0.1.11.tar.gz.

File metadata

  • Download URL: cmsketch-0.1.11.tar.gz
  • Upload date:
  • Size: 35.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cmsketch-0.1.11.tar.gz
Algorithm Hash digest
SHA256 c8cb329b0eecb6d07bb947889bb0290251460b1467a6cb228831dc9a99e220fd
MD5 75bc0bf5e8382ca582d9e503939f56cc
BLAKE2b-256 b25064605732be51adad22e19135350534d67ceef05b0ee3b41882a3b6b3f2eb

See more details on using hashes here.

File details

Details for the file cmsketch-0.1.11-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: cmsketch-0.1.11-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 290.6 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cmsketch-0.1.11-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6b39ff7b86acf4195227d0f841663e117c8b995be39efdeb7a359c5bfad560a1
MD5 b73e817184ba70f4ddcb06cd8f586ee5
BLAKE2b-256 f718322cd26897b05f2eb156c72b515979298f81b6e9618c243a712f86746f59

See more details on using hashes here.

File details

Details for the file cmsketch-0.1.11-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cmsketch-0.1.11-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 97c7b65506478f8891e9d3ca43579052a9e5d0f1b705824b8ef555b987bc99d0
MD5 947553190d5ac612d36d553b4b2c5835
BLAKE2b-256 b66497acee3ff27789525240ddf4e906be9e951b9e6291228801d75d2224e986

See more details on using hashes here.

File details

Details for the file cmsketch-0.1.11-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cmsketch-0.1.11-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3e3b55437bf2c54803273dd64abe49060ac704ffbe66a7bb8a198cd18220a4c2
MD5 b8660ac08920de9adfbcabea0dbd3ae1
BLAKE2b-256 0df9ffadbe7cbf564899703f880834e94964873cc841e468c51c25ce3c5a31c5

See more details on using hashes here.

File details

Details for the file cmsketch-0.1.11-cp312-cp312-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for cmsketch-0.1.11-cp312-cp312-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 a7d9fa83ad642216d4c99371e0c62de05d72a936f1b0796f2982d07719461285
MD5 875d4dee0e2e9b41fe459b269f522cfc
BLAKE2b-256 3924b444d82daed0daeb073a8fa7f6fb5e93b3d537cc6d30b8e5a89825fc634c

See more details on using hashes here.

File details

Details for the file cmsketch-0.1.11-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: cmsketch-0.1.11-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 289.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cmsketch-0.1.11-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 2d7344dc0b8027b3d5b05f7af763349c9d438d78f12bc5d48d19e5007cd8ea67
MD5 ed25b4178de5bab2249681f6d1539245
BLAKE2b-256 0202ba4353b2f3a74d0dff86a77397addb589556a5a3ac175dc2a80a36177f54

See more details on using hashes here.

File details

Details for the file cmsketch-0.1.11-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cmsketch-0.1.11-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 16bd47bf49f60ecc033f0060300e4c91447c9c7f2934d370305cbe90bd2bb391
MD5 a96f1180369c20f6a72798171bb14603
BLAKE2b-256 4d6fa5e49c09cf2bf4abda3d085be069d07ec603b8fb8d3da77625721663c5bf

See more details on using hashes here.

File details

Details for the file cmsketch-0.1.11-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cmsketch-0.1.11-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aa380bdf962df3ea965c62f8c9a933547d0e367e0eec577748c8e7d935d52cea
MD5 50a615f4b7ee77910388e7f8e232d253
BLAKE2b-256 43df6bff5ab45fc45954d7c36c2b0ef496c8e55f0d64af33773104e796b0c77b

See more details on using hashes here.

File details

Details for the file cmsketch-0.1.11-cp311-cp311-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for cmsketch-0.1.11-cp311-cp311-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 7d9b8bdc7eb7917cdfe4485f8836a666d5884e0dfeab01aa721ff9d094c215d2
MD5 7f6669a2f261e7e45605984f0be32462
BLAKE2b-256 3891d488e1dc9e97cc545e7ae1680eb240f4388d6cc7490dcb68b0e0b2c16a96

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page