High-performance Count-Min Sketch implementation with C++ and Python versions
Project description
Count-Min Sketch
A high-performance C++ implementation of the Count-Min Sketch probabilistic data structure with Python bindings.
Project Purpose
This project serves as an educational exploration of:
- Python Package Development: Building Python packages with C++ implementations using modern tools (pybind11, scikit-build-core, uv)
- Performance Comparison: Comparing C++ and Python native implementations of the same algorithm
- Build & Publishing Pipeline: Complete workflow from C++ development to Python package distribution
- Modern C++ Features: Template-based design, thread safety, and CMake integration
The implementation is inspired by the CMU 15-445/645 Database Systems course Project #0, which focuses on implementing a Count-Min Sketch data structure. This project extends that educational foundation by exploring how to package C++ implementations for Python consumption and comparing performance characteristics.
What is Count-Min Sketch?
The Count-Min Sketch is a probabilistic data structure that provides approximate frequency counts for items in a stream. It's particularly useful for:
Learn more: Count-Min Sketch on Wikipedia
- Streaming data analysis - Process large datasets without storing all items
- Frequency estimation - Get approximate counts with bounded error
- Memory efficiency - O(width ร depth) space complexity
- Real-time applications - Fast insertions and queries
Features
- โก High Performance - Optimized C++ with atomic operations for thread safety
- ๐ง Template-Based - Supports any hashable key type (strings, integers, etc.)
- ๐ Python Bindings - Easy-to-use Python interface via pybind11
- ๐งต Thread-Safe - Concurrent access with atomic operations
- ๐ Cross-Platform - Works on Linux, macOS, and Windows
- ๐ฆ Easy Installation - Available on PyPI
Quick Start
Installation
# Using pip
pip install cmsketch
# Using uv (recommended)
uv add cmsketch
Basic Usage
import cmsketch
# Create a sketch for strings
sketch = cmsketch.CountMinSketchStr(1000, 5)
# Add elements
sketch.insert("apple")
sketch.insert("apple")
sketch.insert("banana")
# Query frequencies
print(f"apple: {sketch.count('apple')}") # 2
print(f"banana: {sketch.count('banana')}") # 1
print(f"cherry: {sketch.count('cherry')}") # 0
# Get top-k items
candidates = ["apple", "banana", "cherry"]
top_k = sketch.top_k(2, candidates)
for item, count in top_k:
print(f"{item}: {count}")
C++ Usage
#include "cmsketch/cmsketch.h"
#include <iostream>
int main() {
// Create a sketch
cmsketch::CountMinSketch<std::string> sketch(1000, 5);
// Add elements
sketch.Insert("apple");
sketch.Insert("apple");
sketch.Insert("banana");
// Query frequencies
std::cout << "apple: " << sketch.Count("apple") << std::endl; // 2
std::cout << "banana: " << sketch.Count("banana") << std::endl; // 1
std::cout << "cherry: " << sketch.Count("cherry") << std::endl; // 0
return 0;
}
API Reference
Python Classes
| Class | Description |
|---|---|
CountMinSketchStr |
String-based sketch |
CountMinSketchInt |
Integer-based sketch |
Key Methods
| Method | Description |
|---|---|
insert(item) |
Insert an item into the sketch |
count(item) |
Get estimated count of an item |
top_k(k, candidates) |
Get top k items from candidates |
merge(other) |
Merge another sketch |
clear() |
Reset sketch to initial state |
get_width() |
Get sketch width |
get_depth() |
Get sketch depth |
Configuration
The sketch is configured with two parameters:
- Width: Number of counters per hash function (higher = more accurate)
- Depth: Number of hash functions (higher = more accurate)
# More accurate but uses more memory
sketch = cmsketch.CountMinSketchStr(10000, 7)
# Less accurate but uses less memory
sketch = cmsketch.CountMinSketchStr(1000, 3)
Error Bounds
The Count-Min Sketch provides the following guarantees:
- Overestimate: Estimates are always โฅ actual frequency
- Error Bound: Error is bounded by sketch dimensions
- Memory: O(width ร depth) counters
- Thread Safety: Atomic operations ensure concurrent access
Performance
The C++ implementation provides substantial performance improvements over the Python implementation, as demonstrated by comprehensive benchmarks using 100,000 IP address samples with realistic frequency distributions.
Benchmark Results
Based on actual performance testing with 100k IP address samples:
Insertion Performance
- C++: 45.79ms (21.8 ops/sec) for 100k threaded inserts
- Python: 8,751ms (0.11 ops/sec) for 100k threaded inserts
- Speedup: ~191x faster in C++
Query Performance
- C++: 4.71ฮผs per count operation (212k ops/sec)
- Python: 858.58ฮผs per count operation (1.2k ops/sec)
- Speedup: ~182x faster in C++
Top-K Performance
- C++: 2.57ฮผs per top-k operation (389k ops/sec)
- Python: 857.54ฮผs per top-k operation (1.2k ops/sec)
- Speedup: ~334x faster in C++
Streaming Performance (Insert + Top-K)
- C++: 46.03ms for complete 100k insert + top-k pipeline
- Python: 8,890ms for complete 100k insert + top-k pipeline
- Speedup: ~193x faster in C++
Performance Analysis
The dramatic performance differences stem from several key factors:
1. Memory Access Patterns
- C++: Direct memory access with optimized data structures
- Python: Object overhead and dynamic typing add significant overhead
2. Hash Function Efficiency
- C++: Compiled hash functions with minimal overhead
- Python: Interpreted hash operations with additional function call costs
3. Thread Safety Implementation
- C++: Native atomic operations for thread-safe counters
- Python: GIL limitations and additional synchronization overhead
4. Memory Efficiency
- C++: Compact memory layout with minimal overhead
- Python: Python object overhead (~24-40 bytes per object vs 4-8 bytes for primitives)
5. Compilation vs Interpretation
- C++: Optimized machine code generated by compiler
- Python: Bytecode interpretation with runtime overhead
The results demonstrate that for high-frequency operations like counting and streaming analytics, the C++ implementation provides substantial performance benefits that scale with data volume and operation frequency.
Benchmark Suite
The project includes a comprehensive benchmark suite that tests real-world scenarios:
Test Data
- 100,000 IP address samples generated using Faker with weighted distribution (10 unique IPs)
- Realistic frequency patterns (most frequent IP appears ~10% of the time)
- Threaded processing with 10 concurrent workers and 1,000-item batches
Benchmark Categories
| Category | Description | Tests |
|---|---|---|
| Insert | Bulk insertion performance | C++ vs Python with 100k threaded inserts |
| Count | Query performance | Frequency counting for all unique items |
| Top-K | Top-k retrieval | Finding top 3 most frequent items |
| Streaming | End-to-end workflows | Complete insert + top-k pipeline |
Running Benchmarks
# Run benchmarks with pytest
uv run pytest ./benchmarks
Benchmark Features
- Realistic data: Uses Faker-generated IP addresses with weighted distributions (10 unique IPs)
- Threaded testing: Tests concurrent access patterns with 10 workers and 1k-item batches
- Comparative analysis: Direct C++ vs Python performance comparison across all operations
- Statistical accuracy: Uses pytest-benchmark for reliable measurements with multiple rounds
- Automated data generation: Creates test data if missing with consistent distributions
- Comprehensive coverage: Tests insertion, counting, top-k, and streaming workflows
Building from Source
Prerequisites
- C++17 compatible compiler
- CMake 3.15+
- Python 3.11+ (for Python bindings)
- pybind11 (for Python bindings)
Quick Build
# Clone the repository
git clone https://github.com/isaac-fate/count-min-sketch.git
cd count-min-sketch
# Build everything
make build
# Run tests
make test
# Run example
make example
Development Setup
# Clone the repository
git clone https://github.com/isaac-fate/count-min-sketch.git
cd count-min-sketch
# Install all dependencies (including dev dependencies)
uv sync --dev
# Build the C++ library and Python bindings
uv run python -m pip install -e .
# Run Python tests
uv run pytest pytests/
# Run C++ tests
make build-dev
cd build && make test
# Run benchmarks
uv run pytest ./benchmarks
GitHub Actions
This project uses GitHub Actions for automated CI/CD workflows:
Workflows
test.yml: Runs C++ and Python tests on all platformswheels.yml: Builds wheels for Windows, Linux, and macOS using cibuildwheelrelease.yml: Automatically publishes wheels to PyPI on release
Supported Platforms
- Python Versions: 3.11 and 3.12
- Architectures:
- Windows: x86_64
- Linux: x86_64
- macOS: Intel (x86_64) and Apple Silicon (arm64)
Triggering Workflows
# Push to trigger tests and wheel builds
git push origin main
# Create a release to upload all wheels to PyPI
git tag v0.1.0
git push origin v0.1.0
Workflow Features
- Cross-Platform Compilation: Uses cibuildwheel for consistent wheel building
- Dependency Management: Automated dependency installation and caching
- Test Coverage: Comprehensive testing across all supported platforms
- Automated Publishing: PyPI upload on release
Project Structure
count-min-sketch/
โโโ include/cmsketch/ # C++ header files
โ โโโ cmsketch.h # Main header (include this)
โ โโโ count_min_sketch.h # Core Count-Min Sketch template class
โ โโโ hash_util.h # Hash utility functions
โโโ src/cmsketchcpp/ # C++ source files
โ โโโ count_min_sketch.cc # Core implementation
โโโ src/cmsketch/ # Python package source
โ โโโ __init__.py # Package initialization
โ โโโ base.py # Base classes and interfaces
โ โโโ _core.pyi # Type stubs for C++ bindings
โ โโโ _version.py # Version information
โ โโโ py.typed # Type checking marker
โ โโโ py/ # Pure Python implementations
โ โโโ count_min_sketch.py # Python Count-Min Sketch implementation
โ โโโ hash_util.py # Python hash utilities
โโโ src/ # Additional source files
โ โโโ main.cc # Example C++ application
โ โโโ python_bindings.cc # Python bindings (pybind11)
โโโ tests/ # C++ unit tests
โ โโโ CMakeLists.txt # Test configuration
โ โโโ test_count_min_sketch.cc # Core functionality tests
โ โโโ test_hash_functions.cc # Hash function tests
โ โโโ test_sketch_config.cc # Configuration tests
โโโ pytests/ # Python tests
โ โโโ __init__.py # Test package init
โ โโโ conftest.py # Pytest configuration
โ โโโ test_count_min_sketch.py # Core Python tests
โ โโโ test_hash_util.py # Hash utility tests
โ โโโ test_mixins.py # Mixin class tests
โ โโโ test_py_count_min_sketch.py # Pure Python implementation tests
โโโ benchmarks/ # Performance benchmarks
โ โโโ __init__.py # Benchmark package init
โ โโโ generate_data.py # Data generation utilities
โ โโโ test_benchmarks.py # Benchmark validation tests
โโโ examples/ # Example scripts
โ โโโ example.py # Python usage example
โโโ scripts/ # Build and deployment scripts
โ โโโ build.sh # Production build script
โ โโโ build-dev.sh # Development build script
โโโ data/ # Sample data files
โ โโโ ips.txt # IP address sample data
โ โโโ unique-ips.txt # Unique IP sample data
โโโ build/ # Build artifacts (generated)
โ โโโ _core.cpython-*.so # Compiled Python extensions
โ โโโ cmsketch_example # Compiled C++ example
โ โโโ libcmsketch.a # Static library
โ โโโ tests/ # Compiled test binaries
โโโ dist/ # Distribution packages (generated)
โ โโโ cmsketch-*.whl # Python wheel packages
โโโ CMakeLists.txt # Main CMake configuration
โโโ pyproject.toml # Python package configuration
โโโ uv.lock # uv lock file
โโโ Makefile # Convenience make targets
โโโ LICENSE # MIT License
โโโ README.md # This file
Educational Value
This project demonstrates several important software engineering concepts:
1. Python Package Development with C++ Extensions
- pybind11 Integration: Seamless C++ to Python binding generation
- scikit-build-core: Modern Python build system for C++ extensions
- uv Package Management: Fast, modern Python package management
- Type Stubs: Complete type information for Python IDEs
2. Performance Engineering
- C++ vs Python: Direct performance comparison between implementations
- Memory Efficiency: Optimized data structures and memory usage patterns
- Thread Safety: Atomic operations and concurrent access patterns
- Benchmarking: Comprehensive performance testing and profiling
3. Build System Integration
- CMake: Cross-platform C++ build configuration
- Python Packaging: Complete pip-installable package creation
- CI/CD: Automated testing and publishing workflows
- Cross-Platform: Support for multiple operating systems and architectures
4. Modern C++ Practices
- Template Metaprogramming: Generic, type-safe implementations
- RAII: Resource management and exception safety
- STL Integration: Standard library containers and algorithms
- Google Style Guide: Consistent, readable code formatting
Contributing
- Fork the repository
- Create a feature branch
- Follow Google C++ Style Guide
- Add tests for new features
- Ensure all tests pass
- Submit a pull request
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cmsketch-0.1.11.tar.gz.
File metadata
- Download URL: cmsketch-0.1.11.tar.gz
- Upload date:
- Size: 35.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8cb329b0eecb6d07bb947889bb0290251460b1467a6cb228831dc9a99e220fd
|
|
| MD5 |
75bc0bf5e8382ca582d9e503939f56cc
|
|
| BLAKE2b-256 |
b25064605732be51adad22e19135350534d67ceef05b0ee3b41882a3b6b3f2eb
|
File details
Details for the file cmsketch-0.1.11-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: cmsketch-0.1.11-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 290.6 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b39ff7b86acf4195227d0f841663e117c8b995be39efdeb7a359c5bfad560a1
|
|
| MD5 |
b73e817184ba70f4ddcb06cd8f586ee5
|
|
| BLAKE2b-256 |
f718322cd26897b05f2eb156c72b515979298f81b6e9618c243a712f86746f59
|
File details
Details for the file cmsketch-0.1.11-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: cmsketch-0.1.11-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 113.9 kB
- Tags: CPython 3.12, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97c7b65506478f8891e9d3ca43579052a9e5d0f1b705824b8ef555b987bc99d0
|
|
| MD5 |
947553190d5ac612d36d553b4b2c5835
|
|
| BLAKE2b-256 |
b66497acee3ff27789525240ddf4e906be9e951b9e6291228801d75d2224e986
|
File details
Details for the file cmsketch-0.1.11-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: cmsketch-0.1.11-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 105.1 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e3b55437bf2c54803273dd64abe49060ac704ffbe66a7bb8a198cd18220a4c2
|
|
| MD5 |
b8660ac08920de9adfbcabea0dbd3ae1
|
|
| BLAKE2b-256 |
0df9ffadbe7cbf564899703f880834e94964873cc841e468c51c25ce3c5a31c5
|
File details
Details for the file cmsketch-0.1.11-cp312-cp312-macosx_10_15_x86_64.whl.
File metadata
- Download URL: cmsketch-0.1.11-cp312-cp312-macosx_10_15_x86_64.whl
- Upload date:
- Size: 112.5 kB
- Tags: CPython 3.12, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7d9fa83ad642216d4c99371e0c62de05d72a936f1b0796f2982d07719461285
|
|
| MD5 |
875d4dee0e2e9b41fe459b269f522cfc
|
|
| BLAKE2b-256 |
3924b444d82daed0daeb073a8fa7f6fb5e93b3d537cc6d30b8e5a89825fc634c
|
File details
Details for the file cmsketch-0.1.11-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: cmsketch-0.1.11-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 289.8 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d7344dc0b8027b3d5b05f7af763349c9d438d78f12bc5d48d19e5007cd8ea67
|
|
| MD5 |
ed25b4178de5bab2249681f6d1539245
|
|
| BLAKE2b-256 |
0202ba4353b2f3a74d0dff86a77397addb589556a5a3ac175dc2a80a36177f54
|
File details
Details for the file cmsketch-0.1.11-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: cmsketch-0.1.11-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 113.6 kB
- Tags: CPython 3.11, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16bd47bf49f60ecc033f0060300e4c91447c9c7f2934d370305cbe90bd2bb391
|
|
| MD5 |
a96f1180369c20f6a72798171bb14603
|
|
| BLAKE2b-256 |
4d6fa5e49c09cf2bf4abda3d085be069d07ec603b8fb8d3da77625721663c5bf
|
File details
Details for the file cmsketch-0.1.11-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: cmsketch-0.1.11-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 104.3 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa380bdf962df3ea965c62f8c9a933547d0e367e0eec577748c8e7d935d52cea
|
|
| MD5 |
50a615f4b7ee77910388e7f8e232d253
|
|
| BLAKE2b-256 |
43df6bff5ab45fc45954d7c36c2b0ef496c8e55f0d64af33773104e796b0c77b
|
File details
Details for the file cmsketch-0.1.11-cp311-cp311-macosx_10_15_x86_64.whl.
File metadata
- Download URL: cmsketch-0.1.11-cp311-cp311-macosx_10_15_x86_64.whl
- Upload date:
- Size: 111.0 kB
- Tags: CPython 3.11, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d9b8bdc7eb7917cdfe4485f8836a666d5884e0dfeab01aa721ff9d094c215d2
|
|
| MD5 |
7f6669a2f261e7e45605984f0be32462
|
|
| BLAKE2b-256 |
3891d488e1dc9e97cc545e7ae1680eb240f4388d6cc7490dcb68b0e0b2c16a96
|