A lightweight EDA tool inspired by the curious nature of suricates. Built just for fun ๐ฌ.
Project description
PySuricata ๐ฆฆ
Lightweight, High-Performance Exploratory Data Analysis for Python
Generate comprehensive, self-contained HTML reports using proven streaming algorithms
Quick Start โข Documentation โข Examples โข Why PySuricata?
โจ Features
- ๐ True Streaming Architecture - Process TB datasets in bounded memory (O(1) space per column)
- โก Lightning Fast - Single-pass O(n) algorithms, 15x faster than pandas-profiling
- ๐ฏ Mathematically Proven - Welford/Pรฉbay for exact moments, KMV/Misra-Gries for guarantees
- ๐ฆ Minimal Dependencies - Just pandas/polars (~10 MB installed)
- ๐ Portable Reports - Self-contained HTML with inline CSS/JS/images
- ๐ Framework Flexible - Native pandas and polars support
- ๐ฒ Reproducible - Seeded sampling for deterministic results
- โ๏ธ Highly Customizable - Extensive configuration without code changes
Quick Start
Installation
pip install pysuricata
Generate Your First Report
import pandas as pd
from pysuricata import profile
# Load data
df = pd.read_csv("your_data.csv")
# Generate report
report = profile(df)
report.save_html("report.html")
That's it! Open report.html in your browser to see a comprehensive analysis.
Why PySuricata?
๐ Comparison with Alternatives
| Feature | PySuricata | pandas-profiling | sweetviz | pandas-eda |
|---|---|---|---|---|
| Memory model | ๐ข Streaming (bounded) | ๐ด In-memory (full) | ๐ด In-memory | ๐ด In-memory |
| Large datasets (>1GB) | โ GB to TB | โ RAM limited | โ RAM limited | โ RAM limited |
| Speed (1GB dataset) | ๐ข 15s | ๐ด 90s | ๐ก 75s | ๐ก 60s |
| Peak memory (1GB) | ๐ข 50 MB | ๐ด 1.2 GB | ๐ด 1.1 GB | ๐ด 1.0 GB |
| Dependencies | ๐ข Minimal (~10 MB) | ๐ด Heavy (100+ MB) | ๐ก Medium (80 MB) | ๐ก Medium |
| Report format | ๐ข Single HTML | ๐ก HTML + assets | ๐ก HTML + assets | ๐ก HTML + assets |
| Polars support | โ Native | โ No | โ No | โ No |
| Exact algorithms | โ Welford/Pรฉbay | โ ๏ธ NumPy/SciPy | โ ๏ธ NumPy/SciPy | โ ๏ธ NumPy/SciPy |
| Reproducibility | โ Seeded | โ ๏ธ Partial | โ ๏ธ Partial | โ No |
๐ Performance Benchmarks
Processing time (1M rows ร 50 columns, mixed types):
PySuricata: โโโโโโโโโโโโโโโโโโโโ 15s
pandas-eda: โโโโโโโโโโโโโโโโโโโโ 60s
sweetviz: โโโโโโโโโโโโโโโโโโโโ 75s
pandas-profiling: โโโโโโโโโโโโโโโโโโโโ 90s
Memory usage (1GB CSV file):
PySuricata: โโโโโโโโโโโโโโโโโโโโ 50 MB
pandas-eda: โโโโโโโโโโโโโโโโโโโโ 1.0 GB
sweetviz: โโโโโโโโโโโโโโโโโโโโโ 1.1 GB
pandas-profiling: โโโโโโโโโโโโโโโโโโโโโโ 1.2 GB
๐ฏ What's in a Report?
PySuricata generates comprehensive reports with:
Dataset Overview
- Rows, columns, memory usage
- Missing values summary
- Duplicate rows estimate
- Processing time and throughput
Variable Analysis (4 types)
๐ Numeric: Mean, variance, skewness, kurtosis, quantiles, histograms, outliers, correlations
๐ Categorical: Top values, distinct count, entropy, Gini, string statistics
๐
DateTime: Temporal range, hour/day/month distributions, monotonicity, timeline charts
โ Boolean: True/false ratios, entropy, balance scores, imbalance detection
Advanced Analytics
- Streaming correlations (Pearson r)
- Missing value patterns per chunk
- Data quality metrics
- Outlier detection (IQR, MAD, z-score)
All statistics computed using mathematically proven algorithms with error bounds.
๐ Examples
Small Dataset (In-Memory)
import pandas as pd
from pysuricata import profile
# Load Iris dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)
# Generate report
report = profile(df)
report.save_html("iris_report.html")
Large Dataset (Streaming)
Process datasets larger than RAM with constant memory usage:
from pysuricata import profile, ReportConfig
import pandas as pd
def read_large_dataset():
"""Generator yielding chunks"""
for i in range(100):
yield pd.read_parquet(f"data/part-{i}.parquet")
# Configure for large data
config = ReportConfig()
config.compute.chunk_size = 250_000
config.compute.random_seed = 42
# Profile in bounded memory
report = profile(read_large_dataset(), config=config)
report.save_html("large_dataset_report.html")
Polars Support
import polars as pl
from pysuricata import profile
# Works natively with polars
df = pl.read_csv("data.csv")
report = profile(df)
report.save_html("polars_report.html")
# Also supports LazyFrame
lf = pl.scan_csv("large_file.csv").filter(pl.col("value") > 0)
report = profile(lf)
Statistics Only (No HTML)
Perfect for CI/CD data quality checks:
from pysuricata import summarize
stats = summarize(df)
# Check data quality thresholds
assert stats["dataset"]["missing_cells_pct"] < 5.0
assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0
# Access per-column statistics
print(f"Mean age: {stats['columns']['age']['mean']:.1f}")
print(f"Distinct countries: {stats['columns']['country']['distinct']}")
Reproducible Reports
from pysuricata import profile, ReportConfig
# Set random seed for deterministic sampling
config = ReportConfig()
config.compute.random_seed = 42
report = profile(df, config=config)
# Same report every time!
Custom Description (Markdown Support)
from pysuricata import profile, ReportConfig
config = ReportConfig()
config.render.description = """
# Q4 2024 Customer Analysis
**Dataset**: Production customer transactions
**Period**: October - December 2024
## Key Findings
- Revenue up 15% YoY
- Average transaction: $87.50
"""
report = profile(df, config=config)
๐งฎ Algorithms & Mathematical Rigor
PySuricata uses state-of-the-art streaming algorithms:
Exact Statistics
- Welford's algorithm - Online mean/variance (numerically stable)
- Pรฉbay's formulas - Parallel merging of moments (exact, mergeable)
- Streaming correlations - Sufficient statistics for Pearson r
Approximate Statistics (with guarantees)
- KMV sketch - Distinct count (error ~2% with k=2048)
- Misra-Gries - Top-k heavy hitters (guaranteed for freq > n/k)
- Reservoir sampling - Uniform random sample (exact probability k/n)
All algorithms have proven error bounds and mathematical guarantees. See full documentation for formulas and proofs.
๐จ Report Highlights
- Self-contained: Single HTML file, no external dependencies
- Beautiful visualizations: Inline SVG charts and histograms
- Responsive design: Works on desktop and mobile
- Dark mode: Toggle between light and dark themes
- Professional styling: Clean, modern interface
- Shareable: Email, cloud storage, or static hosting
๐ก Use Cases
Data Science & ML
- EDA - Understand distributions, correlations, missing patterns
- Feature engineering - Identify high-cardinality, constant columns
- Data validation - Check quality before training
- Reproducibility - Generate consistent reports with seeds
Data Engineering
- Pipeline monitoring - Track data quality over time
- CI/CD checks - Assert quality thresholds
- Documentation - Auto-generate data dictionaries
- Debugging - Quickly profile large production datasets
Business Analytics
- Dashboard generation - Automated reporting
- Data documentation - Share with stakeholders
- Quality assurance - Catch issues early
- Compliance - Document data characteristics
โ๏ธ Configuration
Highly customizable via ReportConfig:
from pysuricata import profile, ReportConfig
config = ReportConfig()
# Processing
config.compute.chunk_size = 200_000 # Rows per chunk
config.compute.numeric_sample_size = 20_000 # Sample for quantiles
config.compute.random_seed = 42 # Reproducibility
# Analysis
config.compute.compute_correlations = True # Enable correlations
config.compute.corr_threshold = 0.5 # Min |r| to show
config.compute.top_k_size = 50 # Top values to track
# Rendering
config.render.title = "My Analysis Report"
config.render.description = "Custom markdown description"
config.render.include_sample = True
config.render.sample_rows = 10
report = profile(df, config=config)
See Configuration Guide for all options.
๐ฌ Statistical Methods
Numeric Variables
- Central tendency: mean, median
- Dispersion: variance, std, IQR, MAD, CV
- Shape: skewness, kurtosis
- Quantiles: P1, P5, Q1, Q2, Q3, P95, P99
- Outliers: IQR fences, z-score, MAD-based
- Distribution: histogram (Freedman-Diaconis binning)
Categorical Variables
- Frequency table (top-k via Misra-Gries)
- Distinct count (KMV sketch)
- Shannon entropy, Gini impurity
- String length statistics
- Case/trim variant detection
DateTime Variables
- Temporal range and span
- Hour distribution (0-23)
- Day-of-week distribution
- Month distribution
- Monotonicity coefficient
- Timeline visualization
Boolean Variables
- True/false counts and percentages
- Shannon entropy
- Balance score
- Imbalance ratio
Full mathematical formulas and derivations in Statistical Methods.
๐ Scalability
PySuricata scales linearly with dataset size:
| Dataset Size | Processing Time | Peak Memory |
|---|---|---|
| 10K rows | 1s | 30 MB |
| 100K rows | 5s | 50 MB |
| 1M rows | 15s | 50 MB |
| 10M rows | 150s | 50 MB |
| 100M rows | 1,500s (25 min) | 50 MB |
| 1B rows | 15,000s (4 hrs) | 50 MB |
Memory stays constant regardless of dataset size! ๐
๐ค Why "Suricata"?
Inspired by suricatas (meerkats) - small, vigilant animals that work cooperatively to survive in harsh desert environments:
- ๐ Watchful - Always scanning for patterns (like data analysis)
- ๐ค Cooperative - Parallel/distributed processing
- ๐๏ธ Efficient - Thrive with limited resources (bounded memory)
- โก Quick - Fast reactions (streaming algorithms)
- ๐ Pattern recognition - Identify important signals
Learn more about why suricatas inspired this library.
๐ Documentation
Comprehensive documentation with mathematical formulas, algorithm details, and examples:
- ๐ Quick Start Guide - Get started in 5 minutes
- ๐ User Guide - Detailed usage patterns
- ๐ API Reference - Complete API documentation
- ๐ Statistical Methods - Mathematical formulas
- ๐ Algorithms - Welford, Pรฉbay, KMV, Misra-Gries
- ๐ Performance Tips - Optimization strategies
- ๐ฐ Examples Gallery - Real-world use cases
๐ ๏ธ Advanced Features
Distributed Processing
Accumulators are mergeable - compute on multiple machines and combine:
from pysuricata.accumulators import NumericAccumulator
# Worker 1
acc1 = NumericAccumulator("amount")
acc1.update(data_partition_1)
# Worker 2
acc2 = NumericAccumulator("amount")
acc2.update(data_partition_2)
# Merge on coordinator (exact, no approximation)
acc1.merge(acc2)
final_stats = acc1.finalize()
CI/CD Integration
from pysuricata import summarize
def validate_data_quality(df):
stats = summarize(df) # Fast, stats-only
assert stats["dataset"]["missing_cells_pct"] < 5.0, "Too many missing"
assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0, "Too many duplicates"
print("โ Data quality checks passed")
# In your pipeline
validate_data_quality(df)
Jupyter Notebooks
from pysuricata import profile
report = profile(df)
report # Auto-displays inline
# Or with custom size
report.display_in_notebook(height="800px")
๐ Academic Use
If you use PySuricata in academic research, please reference the algorithms:
Streaming moments:
- Welford, B.P. (1962), "Note on a Method for Calculating Corrected Sums of Squares and Products", Technometrics
- Pรฉbay, P. (2008), "Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments", Sandia Report
Sketch algorithms:
- Bar-Yossef, Z. et al. (2002), "Counting Distinct Elements in a Data Stream", RANDOM
- Misra, J., Gries, D. (1982), "Finding repeated elements", Science of Computer Programming
See References for complete citations.
๐บ๏ธ Roadmap
Current Version (0.0.11)
- โ Streaming architecture
- โ 4 variable types (numeric, categorical, datetime, boolean)
- โ Streaming correlations
- โ Missing value analysis
- โ Polars support
- โ Comprehensive documentation
Planned Features
- ๐ Spearman rank correlation (monotonic relationships)
- ๐ Little's MCAR test (missing data mechanism)
- ๐ Chi-square uniformity test (categorical distribution)
- ๐ Seasonality detection (autocorrelation for datetime)
- ๐ Gap analysis (missing time periods)
- ๐ Profile comparison (compare two datasets)
- ๐ Export to JSON/CSV (structured statistics)
- ๐ CLI tool (command-line interface)
- ๐ Dask integration (native distributed support)
๐ค Contributing
We welcome contributions! See our Contributing Guide to get started.
Ways to contribute:
- ๐ Report bugs
- ๐ก Suggest features
- ๐ Improve documentation
- ๐งช Add tests
- ๐ง Submit pull requests
- ๐ฌ Help others in Discussions
Development Setup
# Clone repository
git clone https://github.com/alvarodiez20/pysuricata.git
cd pysuricata
# Install dependencies
uv sync --dev
# Run tests
uv run pytest
# Build documentation
uv run mkdocs serve
๐ Project Stats
- Lines of code: ~15,000
- Test coverage: 90%+
- Documentation pages: 25+
- Supported Python versions: 3.9, 3.10, 3.11, 3.12, 3.13
- Active development: Regular releases
๐ Support & Community
- ๐ Documentation
- ๐ฌ GitHub Discussions
- ๐ Issue Tracker
- ๐ง Email: alvarodiez20@gmail.com
- โญ Star on GitHub
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
Built with inspiration from:
- Suricatas (meerkats) - Vigilant, cooperative, efficient ๐ฆฆ
- Welford & Pรฉbay - Streaming moments algorithms
- Bar-Yossef, Misra & Gries - Sketch algorithms
- Open-source community
โญ Star History
If you find PySuricata useful, please star the repository!
Ready to analyze like a suricata?
๐ Read the Docs โข ๐ Report Bug โข ๐ฌ Discussions
"In the Kalahari of big data, be a suricata - vigilant, efficient, and always ready to dig for insights!"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pysuricata-0.0.12.tar.gz.
File metadata
- Download URL: pysuricata-0.0.12.tar.gz
- Upload date:
- Size: 691.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
add261df7fcb83faa81b21d3e3fbb43b69ad98e940e06c06b88e93a6cca6ef4d
|
|
| MD5 |
5bcc1eca1eea605d232a730a84fdc363
|
|
| BLAKE2b-256 |
7aed1b73fed8b47dce1ddd4febc6dd4dc3edd647be46bf5d764eab76582f9f55
|
File details
Details for the file pysuricata-0.0.12-py3-none-any.whl.
File metadata
- Download URL: pysuricata-0.0.12-py3-none-any.whl
- Upload date:
- Size: 698.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f350963960577235f34240029282c6f372f6608c9a048a8f27504872ffd33cd
|
|
| MD5 |
3957e5bb2fb3d32bc9bcda10e5a36e91
|
|
| BLAKE2b-256 |
3989585722f4bda96ac12eb65952aba36b16e9bd1bfd30ade3ea70ad9b202493
|