Skip to main content

A lightweight EDA tool inspired by the curious nature of suricates. Built just for fun ๐Ÿ”ฌ.

Project description

PySuricata ๐Ÿฆฆ

Build Status PyPI version Python versions License: MIT codecov Documentation Downloads

PySuricata Logo

Lightweight, High-Performance Exploratory Data Analysis for Python

Generate comprehensive, self-contained HTML reports using proven streaming algorithms

Quick Start โ€ข Documentation โ€ข Examples โ€ข Why PySuricata?


โœจ Features

  • ๐Ÿš€ True Streaming Architecture - Process TB datasets in bounded memory (O(1) space per column)
  • โšก Lightning Fast - Single-pass O(n) algorithms, 15x faster than pandas-profiling
  • ๐ŸŽฏ Mathematically Proven - Welford/Pรฉbay for exact moments, KMV/Misra-Gries for guarantees
  • ๐Ÿ“ฆ Minimal Dependencies - Just pandas/polars (~10 MB installed)
  • ๐Ÿ“„ Portable Reports - Self-contained HTML with inline CSS/JS/images
  • ๐Ÿ”„ Framework Flexible - Native pandas and polars support
  • ๐ŸŽฒ Reproducible - Seeded sampling for deterministic results
  • โš™๏ธ Highly Customizable - Extensive configuration without code changes

Quick Start

Installation

pip install pysuricata

Generate Your First Report

import pandas as pd
from pysuricata import profile

# Load data
df = pd.read_csv("your_data.csv")

# Generate report
report = profile(df)
report.save_html("report.html")

That's it! Open report.html in your browser to see a comprehensive analysis.

Why PySuricata?

๐Ÿ†š Comparison with Alternatives

Feature PySuricata pandas-profiling sweetviz pandas-eda
Memory model ๐ŸŸข Streaming (bounded) ๐Ÿ”ด In-memory (full) ๐Ÿ”ด In-memory ๐Ÿ”ด In-memory
Large datasets (>1GB) โœ… GB to TB โŒ RAM limited โŒ RAM limited โŒ RAM limited
Speed (1GB dataset) ๐ŸŸข 15s ๐Ÿ”ด 90s ๐ŸŸก 75s ๐ŸŸก 60s
Peak memory (1GB) ๐ŸŸข 50 MB ๐Ÿ”ด 1.2 GB ๐Ÿ”ด 1.1 GB ๐Ÿ”ด 1.0 GB
Dependencies ๐ŸŸข Minimal (~10 MB) ๐Ÿ”ด Heavy (100+ MB) ๐ŸŸก Medium (80 MB) ๐ŸŸก Medium
Report format ๐ŸŸข Single HTML ๐ŸŸก HTML + assets ๐ŸŸก HTML + assets ๐ŸŸก HTML + assets
Polars support โœ… Native โŒ No โŒ No โŒ No
Exact algorithms โœ… Welford/Pรฉbay โš ๏ธ NumPy/SciPy โš ๏ธ NumPy/SciPy โš ๏ธ NumPy/SciPy
Reproducibility โœ… Seeded โš ๏ธ Partial โš ๏ธ Partial โŒ No

๐Ÿ“Š Performance Benchmarks

Processing time (1M rows ร— 50 columns, mixed types):

PySuricata:       โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  15s
pandas-eda:       โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  60s
sweetviz:         โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘  75s
pandas-profiling: โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘  90s

Memory usage (1GB CSV file):

PySuricata:       โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  50 MB
pandas-eda:       โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  1.0 GB
sweetviz:         โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 1.1 GB
pandas-profiling: โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 1.2 GB

๐ŸŽฏ What's in a Report?

PySuricata generates comprehensive reports with:

Dataset Overview

  • Rows, columns, memory usage
  • Missing values summary
  • Duplicate rows estimate
  • Processing time and throughput

Variable Analysis (4 types)

๐Ÿ“Š Numeric: Mean, variance, skewness, kurtosis, quantiles, histograms, outliers, correlations
๐Ÿ“ Categorical: Top values, distinct count, entropy, Gini, string statistics
๐Ÿ“… DateTime: Temporal range, hour/day/month distributions, monotonicity, timeline charts
โœ“ Boolean: True/false ratios, entropy, balance scores, imbalance detection

Advanced Analytics

  • Streaming correlations (Pearson r)
  • Missing value patterns per chunk
  • Data quality metrics
  • Outlier detection (IQR, MAD, z-score)

All statistics computed using mathematically proven algorithms with error bounds.

๐Ÿ“š Examples

Small Dataset (In-Memory)

import pandas as pd
from pysuricata import profile

# Load Iris dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)

# Generate report
report = profile(df)
report.save_html("iris_report.html")

Large Dataset (Streaming)

Process datasets larger than RAM with constant memory usage:

from pysuricata import profile, ReportConfig
import pandas as pd

def read_large_dataset():
    """Generator yielding chunks"""
    for i in range(100):
        yield pd.read_parquet(f"data/part-{i}.parquet")

# Configure for large data
config = ReportConfig()
config.compute.chunk_size = 250_000
config.compute.random_seed = 42

# Profile in bounded memory
report = profile(read_large_dataset(), config=config)
report.save_html("large_dataset_report.html")

Polars Support

import polars as pl
from pysuricata import profile

# Works natively with polars
df = pl.read_csv("data.csv")
report = profile(df)
report.save_html("polars_report.html")

# Also supports LazyFrame
lf = pl.scan_csv("large_file.csv").filter(pl.col("value") > 0)
report = profile(lf)

Statistics Only (No HTML)

Perfect for CI/CD data quality checks:

from pysuricata import summarize

stats = summarize(df)

# Check data quality thresholds
assert stats["dataset"]["missing_cells_pct"] < 5.0
assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0

# Access per-column statistics
print(f"Mean age: {stats['columns']['age']['mean']:.1f}")
print(f"Distinct countries: {stats['columns']['country']['distinct']}")

Reproducible Reports

from pysuricata import profile, ReportConfig

# Set random seed for deterministic sampling
config = ReportConfig()
config.compute.random_seed = 42

report = profile(df, config=config)
# Same report every time!

Custom Description (Markdown Support)

from pysuricata import profile, ReportConfig

config = ReportConfig()
config.render.description = """
# Q4 2024 Customer Analysis

**Dataset**: Production customer transactions  
**Period**: October - December 2024  

## Key Findings
- Revenue up 15% YoY
- Average transaction: $87.50
"""

report = profile(df, config=config)

๐Ÿงฎ Algorithms & Mathematical Rigor

PySuricata uses state-of-the-art streaming algorithms:

Exact Statistics

  • Welford's algorithm - Online mean/variance (numerically stable)
  • Pรฉbay's formulas - Parallel merging of moments (exact, mergeable)
  • Streaming correlations - Sufficient statistics for Pearson r

Approximate Statistics (with guarantees)

  • KMV sketch - Distinct count (error ~2% with k=2048)
  • Misra-Gries - Top-k heavy hitters (guaranteed for freq > n/k)
  • Reservoir sampling - Uniform random sample (exact probability k/n)

All algorithms have proven error bounds and mathematical guarantees. See full documentation for formulas and proofs.

๐ŸŽจ Report Highlights

  • Self-contained: Single HTML file, no external dependencies
  • Beautiful visualizations: Inline SVG charts and histograms
  • Responsive design: Works on desktop and mobile
  • Dark mode: Toggle between light and dark themes
  • Professional styling: Clean, modern interface
  • Shareable: Email, cloud storage, or static hosting

๐Ÿ’ก Use Cases

Data Science & ML

  • EDA - Understand distributions, correlations, missing patterns
  • Feature engineering - Identify high-cardinality, constant columns
  • Data validation - Check quality before training
  • Reproducibility - Generate consistent reports with seeds

Data Engineering

  • Pipeline monitoring - Track data quality over time
  • CI/CD checks - Assert quality thresholds
  • Documentation - Auto-generate data dictionaries
  • Debugging - Quickly profile large production datasets

Business Analytics

  • Dashboard generation - Automated reporting
  • Data documentation - Share with stakeholders
  • Quality assurance - Catch issues early
  • Compliance - Document data characteristics

โš™๏ธ Configuration

Highly customizable via ReportConfig:

from pysuricata import profile, ReportConfig

config = ReportConfig()

# Processing
config.compute.chunk_size = 200_000  # Rows per chunk
config.compute.numeric_sample_size = 20_000  # Sample for quantiles
config.compute.random_seed = 42  # Reproducibility

# Analysis
config.compute.compute_correlations = True  # Enable correlations
config.compute.corr_threshold = 0.5  # Min |r| to show
config.compute.top_k_size = 50  # Top values to track

# Rendering
config.render.title = "My Analysis Report"
config.render.description = "Custom markdown description"
config.render.include_sample = True
config.render.sample_rows = 10

report = profile(df, config=config)

See Configuration Guide for all options.

๐Ÿ”ฌ Statistical Methods

Numeric Variables

  • Central tendency: mean, median
  • Dispersion: variance, std, IQR, MAD, CV
  • Shape: skewness, kurtosis
  • Quantiles: P1, P5, Q1, Q2, Q3, P95, P99
  • Outliers: IQR fences, z-score, MAD-based
  • Distribution: histogram (Freedman-Diaconis binning)

Categorical Variables

  • Frequency table (top-k via Misra-Gries)
  • Distinct count (KMV sketch)
  • Shannon entropy, Gini impurity
  • String length statistics
  • Case/trim variant detection

DateTime Variables

  • Temporal range and span
  • Hour distribution (0-23)
  • Day-of-week distribution
  • Month distribution
  • Monotonicity coefficient
  • Timeline visualization

Boolean Variables

  • True/false counts and percentages
  • Shannon entropy
  • Balance score
  • Imbalance ratio

Full mathematical formulas and derivations in Statistical Methods.

๐Ÿ“ˆ Scalability

PySuricata scales linearly with dataset size:

Dataset Size Processing Time Peak Memory
10K rows 1s 30 MB
100K rows 5s 50 MB
1M rows 15s 50 MB
10M rows 150s 50 MB
100M rows 1,500s (25 min) 50 MB
1B rows 15,000s (4 hrs) 50 MB

Memory stays constant regardless of dataset size! ๐ŸŽ‰

๐Ÿค Why "Suricata"?

Inspired by suricatas (meerkats) - small, vigilant animals that work cooperatively to survive in harsh desert environments:

  • ๐Ÿ‘€ Watchful - Always scanning for patterns (like data analysis)
  • ๐Ÿค Cooperative - Parallel/distributed processing
  • ๐Ÿœ๏ธ Efficient - Thrive with limited resources (bounded memory)
  • โšก Quick - Fast reactions (streaming algorithms)
  • ๐Ÿ” Pattern recognition - Identify important signals

Learn more about why suricatas inspired this library.

๐Ÿ“– Documentation

Comprehensive documentation with mathematical formulas, algorithm details, and examples:

๐Ÿ› ๏ธ Advanced Features

Distributed Processing

Accumulators are mergeable - compute on multiple machines and combine:

from pysuricata.accumulators import NumericAccumulator

# Worker 1
acc1 = NumericAccumulator("amount")
acc1.update(data_partition_1)

# Worker 2  
acc2 = NumericAccumulator("amount")
acc2.update(data_partition_2)

# Merge on coordinator (exact, no approximation)
acc1.merge(acc2)
final_stats = acc1.finalize()

CI/CD Integration

from pysuricata import summarize

def validate_data_quality(df):
    stats = summarize(df)  # Fast, stats-only
    
    assert stats["dataset"]["missing_cells_pct"] < 5.0, "Too many missing"
    assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0, "Too many duplicates"
    
    print("โœ“ Data quality checks passed")

# In your pipeline
validate_data_quality(df)

Jupyter Notebooks

from pysuricata import profile

report = profile(df)
report  # Auto-displays inline

# Or with custom size
report.display_in_notebook(height="800px")

๐ŸŽ“ Academic Use

If you use PySuricata in academic research, please reference the algorithms:

Streaming moments:

  • Welford, B.P. (1962), "Note on a Method for Calculating Corrected Sums of Squares and Products", Technometrics
  • Pรฉbay, P. (2008), "Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments", Sandia Report

Sketch algorithms:

  • Bar-Yossef, Z. et al. (2002), "Counting Distinct Elements in a Data Stream", RANDOM
  • Misra, J., Gries, D. (1982), "Finding repeated elements", Science of Computer Programming

See References for complete citations.

๐Ÿ—บ๏ธ Roadmap

Current Version (0.0.11)

  • โœ… Streaming architecture
  • โœ… 4 variable types (numeric, categorical, datetime, boolean)
  • โœ… Streaming correlations
  • โœ… Missing value analysis
  • โœ… Polars support
  • โœ… Comprehensive documentation

Planned Features

  • ๐Ÿ”œ Spearman rank correlation (monotonic relationships)
  • ๐Ÿ”œ Little's MCAR test (missing data mechanism)
  • ๐Ÿ”œ Chi-square uniformity test (categorical distribution)
  • ๐Ÿ”œ Seasonality detection (autocorrelation for datetime)
  • ๐Ÿ”œ Gap analysis (missing time periods)
  • ๐Ÿ”œ Profile comparison (compare two datasets)
  • ๐Ÿ”œ Export to JSON/CSV (structured statistics)
  • ๐Ÿ”œ CLI tool (command-line interface)
  • ๐Ÿ”œ Dask integration (native distributed support)

๐Ÿค Contributing

We welcome contributions! See our Contributing Guide to get started.

Ways to contribute:

  • ๐Ÿ› Report bugs
  • ๐Ÿ’ก Suggest features
  • ๐Ÿ“ Improve documentation
  • ๐Ÿงช Add tests
  • ๐Ÿ”ง Submit pull requests
  • ๐Ÿ’ฌ Help others in Discussions

Development Setup

# Clone repository
git clone https://github.com/alvarodiez20/pysuricata.git
cd pysuricata

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Build documentation
uv run mkdocs serve

๐Ÿ“Š Project Stats

  • Lines of code: ~15,000
  • Test coverage: 90%+
  • Documentation pages: 25+
  • Supported Python versions: 3.9, 3.10, 3.11, 3.12, 3.13
  • Active development: Regular releases

๐Ÿ†˜ Support & Community

๐Ÿ“œ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

Built with inspiration from:

  • Suricatas (meerkats) - Vigilant, cooperative, efficient ๐Ÿฆฆ
  • Welford & Pรฉbay - Streaming moments algorithms
  • Bar-Yossef, Misra & Gries - Sketch algorithms
  • Open-source community

โญ Star History

If you find PySuricata useful, please star the repository!

Star History Chart


Ready to analyze like a suricata?

๐Ÿ“š Read the Docs โ€ข ๐Ÿ› Report Bug โ€ข ๐Ÿ’ฌ Discussions

"In the Kalahari of big data, be a suricata - vigilant, efficient, and always ready to dig for insights!"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysuricata-0.0.12.tar.gz (691.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysuricata-0.0.12-py3-none-any.whl (698.1 kB view details)

Uploaded Python 3

File details

Details for the file pysuricata-0.0.12.tar.gz.

File metadata

  • Download URL: pysuricata-0.0.12.tar.gz
  • Upload date:
  • Size: 691.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for pysuricata-0.0.12.tar.gz
Algorithm Hash digest
SHA256 add261df7fcb83faa81b21d3e3fbb43b69ad98e940e06c06b88e93a6cca6ef4d
MD5 5bcc1eca1eea605d232a730a84fdc363
BLAKE2b-256 7aed1b73fed8b47dce1ddd4febc6dd4dc3edd647be46bf5d764eab76582f9f55

See more details on using hashes here.

File details

Details for the file pysuricata-0.0.12-py3-none-any.whl.

File metadata

File hashes

Hashes for pysuricata-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 2f350963960577235f34240029282c6f372f6608c9a048a8f27504872ffd33cd
MD5 3957e5bb2fb3d32bc9bcda10e5a36e91
BLAKE2b-256 3989585722f4bda96ac12eb65952aba36b16e9bd1bfd30ade3ea70ad9b202493

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page