Skip to main content

A lightweight EDA tool inspired by the curious nature of suricates. Built just for fun ๐Ÿ”ฌ.

Project description

PySuricata ๐Ÿฆฆ

Build Status PyPI version Python versions License: MIT codecov Documentation Downloads

PySuricata Logo

Lightweight, High-Performance Exploratory Data Analysis for Python

Generate comprehensive, self-contained HTML reports using proven streaming algorithms

Quick Start โ€ข Documentation โ€ข Examples โ€ข Why PySuricata?


โœจ Features

  • ๐Ÿš€ True Streaming Architecture - Process datasets larger than RAM with bounded memory (~50MB regardless of size)
  • ๐Ÿ“Š Mathematically Rigorous - Welford/Pรฉbay for exact moments, KMV/Misra-Gries with proven error bounds
  • ๐Ÿ“ฆ Minimal Dependencies - Just pandas/polars + markdown (~10 MB installed)
  • ๐Ÿ“„ Portable Reports - Self-contained HTML with inline CSS/JS/images
  • ๐Ÿ”„ Framework Flexible - Native pandas and polars support
  • ๐ŸŽฒ Reproducible - Seeded sampling for deterministic results
  • โš™๏ธ Highly Customizable - Extensive configuration without code changes
  • ๐Ÿ–ฅ๏ธ CLI Tool - Profile datasets from the command line

Quick Start

Installation

pip install pysuricata

Generate Your First Report

import pandas as pd
from pysuricata import profile

# Load data
df = pd.read_csv("your_data.csv")

# Generate report
report = profile(df)
report.save_html("report.html")

That's it! Open report.html in your browser to see a comprehensive analysis.

Command Line Interface

# Generate HTML report
pysuricata profile data.csv --output report.html

# Get JSON statistics
pysuricata summarize data.csv

Why PySuricata?

๐Ÿ†š Comparison with Alternatives

Feature PySuricata pandas-profiling sweetviz pandas-eda
Memory model ๐ŸŸข Streaming (bounded) ๐Ÿ”ด In-memory (full) ๐Ÿ”ด In-memory ๐Ÿ”ด In-memory
Large datasets (>1GB) โœ… GB to TB โŒ RAM limited โŒ RAM limited โŒ RAM limited
Peak memory (1GB) ๐ŸŸข 50 MB ๐Ÿ”ด 1.2 GB ๐Ÿ”ด 1.1 GB ๐Ÿ”ด 1.0 GB
Dependencies ๐ŸŸข Minimal (~10 MB) ๐Ÿ”ด Heavy (100+ MB) ๐ŸŸก Medium (80 MB) ๐ŸŸก Medium
Report format ๐ŸŸข Single HTML ๐ŸŸก HTML + assets ๐ŸŸก HTML + assets ๐ŸŸก HTML + assets
Polars support โœ… Native โŒ No โŒ No โŒ No
Exact algorithms โœ… Welford/Pรฉbay โš ๏ธ NumPy/SciPy โš ๏ธ NumPy/SciPy โš ๏ธ NumPy/SciPy
Reproducibility โœ… Seeded โš ๏ธ Partial โš ๏ธ Partial โŒ No

๐Ÿ“Š Performance Benchmarks

Memory usage (1M+ rows dataset):

PySuricata:       โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  50 MB (constant!)
pandas-eda:       โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  1.0 GB
sweetviz:         โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 1.1 GB
pandas-profiling: โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 1.2 GB

Key advantage: PySuricata uses ~50MB regardless of dataset size, while competitors require memory proportional to data size.

๐ŸŽฏ What's in a Report?

PySuricata generates comprehensive reports with:

Dataset Overview

  • Rows, columns, memory usage
  • Missing values summary
  • Duplicate rows estimate
  • Processing time and throughput

Variable Analysis (4 types)

๐Ÿ“Š Numeric: Mean, variance, skewness, kurtosis, quantiles, histograms, outliers, correlations
๐Ÿ“ Categorical: Top values, distinct count, entropy, Gini, string statistics
๐Ÿ“… DateTime: Temporal range, hour/day/month distributions, monotonicity, timeline charts
โœ“ Boolean: True/false ratios, entropy, balance scores, imbalance detection

Advanced Analytics

  • Streaming correlations (Pearson r)
  • Missing value patterns per chunk
  • Data quality metrics
  • Outlier detection (IQR, MAD, z-score)

All statistics computed using mathematically proven algorithms with error bounds.

๐Ÿ“š Examples

Small Dataset (In-Memory)

import pandas as pd
from pysuricata import profile

# Load Iris dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)

# Generate report
report = profile(df)
report.save_html("iris_report.html")

Large Dataset (Streaming)

Process datasets larger than RAM with constant memory usage:

from pysuricata import profile, ReportConfig
import pandas as pd

def read_large_dataset():
    """Generator yielding chunks"""
    for i in range(100):
        yield pd.read_parquet(f"data/part-{i}.parquet")

# Configure for large data
config = ReportConfig()
config.compute.chunk_size = 250_000
config.compute.random_seed = 42

# Profile in bounded memory
report = profile(read_large_dataset(), config=config)
report.save_html("large_dataset_report.html")

Polars Support

import polars as pl
from pysuricata import profile

# Works natively with polars
df = pl.read_csv("data.csv")
report = profile(df)
report.save_html("polars_report.html")

# Also supports LazyFrame
lf = pl.scan_csv("large_file.csv").filter(pl.col("value") > 0)
report = profile(lf)

Statistics Only (No HTML)

Perfect for CI/CD data quality checks:

from pysuricata import summarize

stats = summarize(df)

# Check data quality thresholds
assert stats["dataset"]["missing_cells_pct"] < 5.0
assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0

# Access per-column statistics
print(f"Mean age: {stats['columns']['age']['mean']:.1f}")
print(f"Distinct countries: {stats['columns']['country']['distinct']}")

Reproducible Reports

from pysuricata import profile, ReportConfig

# Set random seed for deterministic sampling
config = ReportConfig()
config.compute.random_seed = 42

report = profile(df, config=config)
# Same report every time!

Custom Description (Markdown Support)

from pysuricata import profile, ReportConfig

config = ReportConfig()
config.render.description = """
# Q4 2024 Customer Analysis

**Dataset**: Production customer transactions  
**Period**: October - December 2024  

## Key Findings
- Revenue up 15% YoY
- Average transaction: $87.50
"""

report = profile(df, config=config)

๐Ÿงฎ Algorithms & Mathematical Rigor

PySuricata uses state-of-the-art streaming algorithms:

Exact Statistics

  • Welford's algorithm - Online mean/variance (numerically stable)
  • Pรฉbay's formulas - Parallel merging of moments (exact, mergeable)
  • Streaming correlations - Sufficient statistics for Pearson r

Approximate Statistics (with guarantees)

  • KMV sketch - Distinct count (error ~2% with k=2048)
  • Misra-Gries - Top-k heavy hitters (guaranteed for freq > n/k)
  • Reservoir sampling - Uniform random sample (exact probability k/n)

All algorithms have proven error bounds and mathematical guarantees. See full documentation for formulas and proofs.

๐ŸŽจ Report Highlights

  • Self-contained: Single HTML file, no external dependencies
  • Beautiful visualizations: Inline SVG charts and histograms
  • Responsive design: Works on desktop and mobile
  • Dark mode: Toggle between light and dark themes
  • Professional styling: Clean, modern interface
  • Shareable: Email, cloud storage, or static hosting

๐Ÿ’ก Use Cases

Data Science & ML

  • EDA - Understand distributions, correlations, missing patterns
  • Feature engineering - Identify high-cardinality, constant columns
  • Data validation - Check quality before training
  • Reproducibility - Generate consistent reports with seeds

Data Engineering

  • Pipeline monitoring - Track data quality over time
  • CI/CD checks - Assert quality thresholds
  • Documentation - Auto-generate data dictionaries
  • Debugging - Quickly profile large production datasets

Business Analytics

  • Dashboard generation - Automated reporting
  • Data documentation - Share with stakeholders
  • Quality assurance - Catch issues early
  • Compliance - Document data characteristics

โš™๏ธ Configuration

Highly customizable via ReportConfig:

from pysuricata import profile, ReportConfig

config = ReportConfig()

# Processing
config.compute.chunk_size = 200_000  # Rows per chunk
config.compute.numeric_sample_size = 20_000  # Sample for quantiles
config.compute.random_seed = 42  # Reproducibility

# Analysis
config.compute.compute_correlations = True  # Enable correlations
config.compute.corr_threshold = 0.5  # Min |r| to show
config.compute.top_k_size = 50  # Top values to track

# Rendering
config.render.title = "My Analysis Report"
config.render.description = "Custom markdown description"
config.render.include_sample = True
config.render.sample_rows = 10

report = profile(df, config=config)

See Configuration Guide for all options.

๐Ÿ”ฌ Statistical Methods

Numeric Variables

  • Central tendency: mean, median
  • Dispersion: variance, std, IQR, MAD, CV
  • Shape: skewness, kurtosis
  • Quantiles: P1, P5, Q1, Q2, Q3, P95, P99
  • Outliers: IQR fences, z-score, MAD-based
  • Distribution: histogram (Freedman-Diaconis binning)

Categorical Variables

  • Frequency table (top-k via Misra-Gries)
  • Distinct count (KMV sketch)
  • Shannon entropy, Gini impurity
  • String length statistics
  • Case/trim variant detection

DateTime Variables

  • Temporal range and span
  • Hour distribution (0-23)
  • Day-of-week distribution
  • Month distribution
  • Monotonicity coefficient
  • Timeline visualization

Boolean Variables

  • True/false counts and percentages
  • Shannon entropy
  • Balance score
  • Imbalance ratio

Full mathematical formulas and derivations in Statistical Methods.

๐Ÿ“ˆ Scalability

PySuricata scales linearly with dataset size:

Dataset Size Processing Time Peak Memory Throughput
10K rows ~3s 20 MB ~3,000 rows/s
100K rows ~13s 30 MB ~8,000 rows/s
1M rows ~3 min 50 MB ~5,500 rows/s
10M rows ~30 min 50 MB ~5,500 rows/s
100M rows ~5 hours 50 MB ~5,500 rows/s

Memory stays constant regardless of dataset size! ๐ŸŽ‰

Benchmarks measured on Apple Silicon with Python 3.13. Actual times vary by hardware and data complexity.

๐Ÿค Why "Suricata"?

Inspired by suricatas (meerkats) - small, vigilant animals that work cooperatively to survive in harsh desert environments:

  • ๐Ÿ‘€ Watchful - Always scanning for patterns (like data analysis)
  • ๐Ÿค Cooperative - Parallel/distributed processing
  • ๐Ÿœ๏ธ Efficient - Thrive with limited resources (bounded memory)
  • โšก Quick - Fast reactions (streaming algorithms)
  • ๐Ÿ” Pattern recognition - Identify important signals

Learn more about why suricatas inspired this library.

๐Ÿ“– Documentation

Comprehensive documentation with mathematical formulas, algorithm details, and examples:

๐Ÿ› ๏ธ Advanced Features

Distributed Processing

Accumulators are mergeable - compute on multiple machines and combine:

from pysuricata.accumulators import NumericAccumulator

# Worker 1
acc1 = NumericAccumulator("amount")
acc1.update(data_partition_1)

# Worker 2  
acc2 = NumericAccumulator("amount")
acc2.update(data_partition_2)

# Merge on coordinator (exact, no approximation)
acc1.merge(acc2)
final_stats = acc1.finalize()

CI/CD Integration

from pysuricata import summarize

def validate_data_quality(df):
    stats = summarize(df)  # Fast, stats-only
    
    assert stats["dataset"]["missing_cells_pct"] < 5.0, "Too many missing"
    assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0, "Too many duplicates"
    
    print("โœ“ Data quality checks passed")

# In your pipeline
validate_data_quality(df)

Jupyter Notebooks

from pysuricata import profile

report = profile(df)
report  # Auto-displays inline

# Or with custom size
report.display_in_notebook(height="800px")

๐ŸŽ“ Academic Use

If you use PySuricata in academic research, please reference the algorithms:

Streaming moments:

  • Welford, B.P. (1962), "Note on a Method for Calculating Corrected Sums of Squares and Products", Technometrics
  • Pรฉbay, P. (2008), "Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments", Sandia Report

Sketch algorithms:

  • Bar-Yossef, Z. et al. (2002), "Counting Distinct Elements in a Data Stream", RANDOM
  • Misra, J., Gries, D. (1982), "Finding repeated elements", Science of Computer Programming

See References for complete citations.

๐Ÿ—บ๏ธ Roadmap

Current Version (0.0.11)

  • โœ… Streaming architecture
  • โœ… 4 variable types (numeric, categorical, datetime, boolean)
  • โœ… Streaming correlations
  • โœ… Missing value analysis
  • โœ… Polars support
  • โœ… Comprehensive documentation

Planned Features

  • ๐Ÿ”œ Spearman rank correlation (monotonic relationships)
  • ๐Ÿ”œ Little's MCAR test (missing data mechanism)
  • ๐Ÿ”œ Chi-square uniformity test (categorical distribution)
  • ๐Ÿ”œ Seasonality detection (autocorrelation for datetime)
  • ๐Ÿ”œ Gap analysis (missing time periods)
  • ๐Ÿ”œ Profile comparison (compare two datasets)
  • ๐Ÿ”œ Export to JSON/CSV (structured statistics)
  • ๐Ÿ”œ CLI tool (command-line interface)
  • ๐Ÿ”œ Dask integration (native distributed support)

๐Ÿค Contributing

We welcome contributions! See our Contributing Guide to get started.

Ways to contribute:

  • ๐Ÿ› Report bugs
  • ๐Ÿ’ก Suggest features
  • ๐Ÿ“ Improve documentation
  • ๐Ÿงช Add tests
  • ๐Ÿ”ง Submit pull requests
  • ๐Ÿ’ฌ Help others in Discussions

Development Setup

# Clone repository
git clone https://github.com/alvarodiez20/pysuricata.git
cd pysuricata

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Build documentation
uv run mkdocs serve

๐Ÿ“Š Project Stats

  • Lines of code: ~15,000
  • Test coverage: 90%+
  • Documentation pages: 25+
  • Supported Python versions: 3.9, 3.10, 3.11, 3.12, 3.13
  • Active development: Regular releases

๐Ÿ†˜ Support & Community

๐Ÿ“œ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

Built with inspiration from:

  • Suricatas (meerkats) - Vigilant, cooperative, efficient ๐Ÿฆฆ
  • Welford & Pรฉbay - Streaming moments algorithms
  • Bar-Yossef, Misra & Gries - Sketch algorithms
  • Open-source community

โญ Star History

If you find PySuricata useful, please star the repository!

Star History Chart


Ready to analyze like a suricata?

๐Ÿ“š Read the Docs โ€ข ๐Ÿ› Report Bug โ€ข ๐Ÿ’ฌ Discussions

"In the Kalahari of big data, be a suricata - vigilant, efficient, and always ready to dig for insights!"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysuricata-0.0.13.tar.gz (713.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysuricata-0.0.13-py3-none-any.whl (705.8 kB view details)

Uploaded Python 3

File details

Details for the file pysuricata-0.0.13.tar.gz.

File metadata

  • Download URL: pysuricata-0.0.13.tar.gz
  • Upload date:
  • Size: 713.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pysuricata-0.0.13.tar.gz
Algorithm Hash digest
SHA256 a95390779a47b63aed0612b3c8980987a9865aa22e948a0197abe9433df7ff36
MD5 35a35c1f3b42b334a8e0d640b51db9d5
BLAKE2b-256 83b656edd9a70d0bd337f7968eb279d2a95c877abfdf778d52ceeb608c9d1fc8

See more details on using hashes here.

File details

Details for the file pysuricata-0.0.13-py3-none-any.whl.

File metadata

  • Download URL: pysuricata-0.0.13-py3-none-any.whl
  • Upload date:
  • Size: 705.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pysuricata-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 0a4f8a4639562282ffb718a1ceddc62b3a571f01e449f14ea20def317d91c310
MD5 5dd354303d1b9a272a1d16c3c8c8fade
BLAKE2b-256 3d71f54d12ce3667fb74626172bee9f1ff6352d1576d86251f92720f1882c779

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page