High-performance Count-Min Sketch implementation with C++ and Python versions
Project description
Count-Min Sketch
A high-performance, template-based C++ implementation of the Count-Min Sketch probabilistic data structure with Python bindings.
Features
- Template-Based Design: Supports any hashable key type (strings, integers, etc.)
- Thread-Safe: Uses atomic operations for concurrent access
- High Performance: Optimized C++ implementation with efficient memory usage
- Python Bindings: Easy-to-use Python interface via pybind11
- Comprehensive Testing: Full test suite with Google Test
- CMake Build System: Modern, cross-platform build configuration
Project Structure
count-min-sketch/
├── include/cmsketch/ # Public header files
│ ├── cmsketch.h # Main header (include this)
│ ├── count_min_sketch.h # Core Count-Min Sketch template class
│ ├── hash_util.h # Hash utility functions
│ └── version.h # Version information
├── src/cmsketch/ # C++ source files
│ ├── count_min_sketch.cc # Core implementation
│ └── version.cc # Version implementation
├── src/ # Additional source files
│ ├── main.cc # Example application
│ └── python_bindings.cc # Python bindings
├── tests/ # Unit tests
│ ├── CMakeLists.txt # Test configuration
│ ├── test_count_min_sketch.cpp
│ ├── test_hash_functions.cpp
│ └── test_sketch_config.cpp
├── docs/ # Documentation
│ ├── CMakeLists.txt # Documentation build
│ └── Doxyfile.in # Doxygen configuration
├── cmake/ # CMake modules
│ └── cmsketchConfig.cmake.in
├── CMakeLists.txt # Main build configuration
├── pyproject.toml # Python package configuration
├── example.py # Python example
└── build.sh # Build script
Building
Prerequisites
- C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
- CMake 3.15+
- Python 3.11+ (for Python bindings)
- pybind11 (for Python bindings)
- Google Test (for testing, optional)
Quick Build
# Make build script executable
chmod +x build.sh
# Build everything
./build.sh
Manual Build
# Create build directory
mkdir build && cd build
# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release
# Build
make -j$(nproc)
# Run tests (optional)
make test
# Run example
./cmsketch_example
Python Package
# Install Python dependencies
pip install pybind11 scikit-build-core
# Build Python package
pip install .
Usage
C++ Example
#include "cmsketch/cmsketch.h"
#include <iostream>
#include <vector>
int main() {
// Create a sketch with width=1000, depth=5
cmsketch::CountMinSketch<std::string> sketch(1000, 5);
// Add elements
sketch.Insert("apple");
sketch.Insert("apple");
sketch.Insert("apple");
sketch.Insert("banana");
sketch.Insert("banana");
sketch.Insert("apple");
// Query frequencies
std::cout << "apple: " << sketch.Count("apple") << std::endl; // 4
std::cout << "banana: " << sketch.Count("banana") << std::endl; // 2
std::cout << "cherry: " << sketch.Count("cherry") << std::endl; // 0
// Test TopK functionality
std::vector<std::string> candidates = {"apple", "banana", "cherry"};
auto top_k = sketch.TopK(2, candidates);
for (const auto& pair : top_k) {
std::cout << pair.first << ": " << pair.second << std::endl;
}
return 0;
}
Python Example
import cmsketch
# Create a sketch
sketch = cmsketch.CountMinSketch(1000, 5)
# Add elements
sketch.insert("apple")
sketch.insert("apple")
sketch.insert("apple")
sketch.insert("banana")
sketch.insert("banana")
sketch.insert("apple")
# Query frequencies
print(f"apple: {sketch.count('apple')}") # 4
print(f"banana: {sketch.count('banana')}") # 2
print(f"cherry: {sketch.count('cherry')}") # 0
# Test TopK functionality
candidates = ["apple", "banana", "cherry"]
top_k = sketch.top_k(2, candidates)
for item, count in top_k:
print(f"{item}: {count}")
API Reference
Core Classes
CountMinSketch<KeyType>: Template-based sketch implementationHashUtil: Hash utility functionsVersion: Version information
Key Methods
Insert(item): Insert an item into the sketchCount(item): Get estimated count of an itemMerge(other): Merge another sketchClear(): Reset sketch to initial stateTopK(k, candidates): Get top k items from candidatesGetWidth(): Get sketch widthGetDepth(): Get sketch depth
Configuration
The sketch is configured with explicit dimensions:
// String keys
cmsketch::CountMinSketch<std::string> sketch(1000, 5);
// Integer keys
cmsketch::CountMinSketch<int> int_sketch(100, 3);
// Int64 keys
cmsketch::CountMinSketch<int64_t> int64_sketch(500, 4);
Error Bounds
The Count-Min Sketch provides the following guarantees:
- Overestimate: Estimates are always ≥ actual frequency
- Error Bound: Error is bounded by the sketch dimensions
- Memory: O(width × depth) counters
- Thread Safety: Atomic operations ensure thread-safe concurrent access
Testing
Run the test suite:
cd build
make test
# or
./cmsketch_tests
Documentation
Generate API documentation:
cd build
make docs
# Documentation will be in docs/html/
Contributing
- Follow Google C++ Style Guide
- Add tests for new features
- Update documentation as needed
- Ensure all tests pass
License
[Add your license here]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cmsketch-0.1.0.tar.gz.
File metadata
- Download URL: cmsketch-0.1.0.tar.gz
- Upload date:
- Size: 43.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a2cec3710d31e8ed1198fe5bab64d7b43b6517fa38aac87549ec039bb0fbf91
|
|
| MD5 |
39fa28e1c670642673e9e83f197616c5
|
|
| BLAKE2b-256 |
516d1ce982e4af7da102e22a98bff009eae7bd4311f211a5d9bfb66f5cf444ce
|
File details
Details for the file cmsketch-0.1.0-cp311-cp311-macosx_14_0_arm64.whl.
File metadata
- Download URL: cmsketch-0.1.0-cp311-cp311-macosx_14_0_arm64.whl
- Upload date:
- Size: 200.2 kB
- Tags: CPython 3.11, macOS 14.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d5bee28b4b0426ec6a433e60539ff8222485a931649e09aa503d4dcf74ed735
|
|
| MD5 |
66d0cb4b8ed5fecb5c59d2f6084bce0a
|
|
| BLAKE2b-256 |
30ce236ca02492268e205604bb087275bfda679d0ef327305c6b7572b999ea33
|