A lightweight EDA tool inspired by the curious nature of suricates. Built just for fun 🔬.

These details have not been verified by PyPI

Project links

Repository

Project description

PySuricata 🦦

Lightweight, High-Performance Exploratory Data Analysis for Python

Generate comprehensive, self-contained HTML reports using proven streaming algorithms

Quick Start • Documentation • Examples • Why PySuricata?

✨ Features

🚀 True Streaming Architecture - Process TB datasets in bounded memory (O(1) space per column)
⚡ Lightning Fast - Single-pass O(n) algorithms, 15x faster than pandas-profiling
🎯 Mathematically Proven - Welford/Pébay for exact moments, KMV/Misra-Gries for guarantees
📦 Minimal Dependencies - Just pandas/polars (~10 MB installed)
📄 Portable Reports - Self-contained HTML with inline CSS/JS/images
🔄 Framework Flexible - Native pandas and polars support
🎲 Reproducible - Seeded sampling for deterministic results
⚙️ Highly Customizable - Extensive configuration without code changes

Quick Start

Installation

pip install pysuricata

Generate Your First Report

import pandas as pd
from pysuricata import profile

# Load data
df = pd.read_csv("your_data.csv")

# Generate report
report = profile(df)
report.save_html("report.html")

That's it! Open report.html in your browser to see a comprehensive analysis.

Why PySuricata?

🆚 Comparison with Alternatives

Feature	PySuricata	pandas-profiling	sweetviz	pandas-eda
Memory model	🟢 Streaming (bounded)	🔴 In-memory (full)	🔴 In-memory	🔴 In-memory
Large datasets (>1GB)	✅ GB to TB	❌ RAM limited	❌ RAM limited	❌ RAM limited
Speed (1GB dataset)	🟢 15s	🔴 90s	🟡 75s	🟡 60s
Peak memory (1GB)	🟢 50 MB	🔴 1.2 GB	🔴 1.1 GB	🔴 1.0 GB
Dependencies	🟢 Minimal (~10 MB)	🔴 Heavy (100+ MB)	🟡 Medium (80 MB)	🟡 Medium
Report format	🟢 Single HTML	🟡 HTML + assets	🟡 HTML + assets	🟡 HTML + assets
Polars support	✅ Native	❌ No	❌ No	❌ No
Exact algorithms	✅ Welford/Pébay	⚠️ NumPy/SciPy	⚠️ NumPy/SciPy	⚠️ NumPy/SciPy
Reproducibility	✅ Seeded	⚠️ Partial	⚠️ Partial	❌ No

📊 Performance Benchmarks

Processing time (1M rows × 50 columns, mixed types):

PySuricata:       ████░░░░░░░░░░░░░░░░  15s
pandas-eda:       ████████████░░░░░░░░  60s
sweetviz:         ███████████████░░░░░  75s
pandas-profiling: ██████████████████░░  90s

Memory usage (1GB CSV file):

PySuricata:       ██░░░░░░░░░░░░░░░░░░  50 MB
pandas-eda:       ████████████████████  1.0 GB
sweetviz:         █████████████████████ 1.1 GB
pandas-profiling: ██████████████████████ 1.2 GB

🎯 What's in a Report?

PySuricata generates comprehensive reports with:

Dataset Overview

Rows, columns, memory usage
Missing values summary
Duplicate rows estimate
Processing time and throughput

Variable Analysis (4 types)

📊 Numeric: Mean, variance, skewness, kurtosis, quantiles, histograms, outliers, correlations
📝 Categorical: Top values, distinct count, entropy, Gini, string statistics
📅 DateTime: Temporal range, hour/day/month distributions, monotonicity, timeline charts
✓ Boolean: True/false ratios, entropy, balance scores, imbalance detection

Advanced Analytics

Streaming correlations (Pearson r)
Missing value patterns per chunk
Data quality metrics
Outlier detection (IQR, MAD, z-score)

All statistics computed using mathematically proven algorithms with error bounds.

📚 Examples

Small Dataset (In-Memory)

import pandas as pd
from pysuricata import profile

# Load Iris dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)

# Generate report
report = profile(df)
report.save_html("iris_report.html")

Large Dataset (Streaming)

Process datasets larger than RAM with constant memory usage:

from pysuricata import profile, ReportConfig
import pandas as pd

def read_large_dataset():
    """Generator yielding chunks"""
    for i in range(100):
        yield pd.read_parquet(f"data/part-{i}.parquet")

# Configure for large data
config = ReportConfig()
config.compute.chunk_size = 250_000
config.compute.random_seed = 42

# Profile in bounded memory
report = profile(read_large_dataset(), config=config)
report.save_html("large_dataset_report.html")

Polars Support

import polars as pl
from pysuricata import profile

# Works natively with polars
df = pl.read_csv("data.csv")
report = profile(df)
report.save_html("polars_report.html")

# Also supports LazyFrame
lf = pl.scan_csv("large_file.csv").filter(pl.col("value") > 0)
report = profile(lf)

Statistics Only (No HTML)

Perfect for CI/CD data quality checks:

from pysuricata import summarize

stats = summarize(df)

# Check data quality thresholds
assert stats["dataset"]["missing_cells_pct"] < 5.0
assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0

# Access per-column statistics
print(f"Mean age: {stats['columns']['age']['mean']:.1f}")
print(f"Distinct countries: {stats['columns']['country']['distinct']}")

Reproducible Reports

from pysuricata import profile, ReportConfig

# Set random seed for deterministic sampling
config = ReportConfig()
config.compute.random_seed = 42

report = profile(df, config=config)
# Same report every time!

Custom Description (Markdown Support)

from pysuricata import profile, ReportConfig

config = ReportConfig()
config.render.description = """
# Q4 2024 Customer Analysis

**Dataset**: Production customer transactions  
**Period**: October - December 2024  

## Key Findings
- Revenue up 15% YoY
- Average transaction: $87.50
"""

report = profile(df, config=config)

🧮 Algorithms & Mathematical Rigor

PySuricata uses state-of-the-art streaming algorithms:

Exact Statistics

Welford's algorithm - Online mean/variance (numerically stable)
Pébay's formulas - Parallel merging of moments (exact, mergeable)
Streaming correlations - Sufficient statistics for Pearson r

Approximate Statistics (with guarantees)

KMV sketch - Distinct count (error ~2% with k=2048)
Misra-Gries - Top-k heavy hitters (guaranteed for freq > n/k)
Reservoir sampling - Uniform random sample (exact probability k/n)

All algorithms have proven error bounds and mathematical guarantees. See full documentation for formulas and proofs.

🎨 Report Highlights

Self-contained: Single HTML file, no external dependencies
Beautiful visualizations: Inline SVG charts and histograms
Responsive design: Works on desktop and mobile
Dark mode: Toggle between light and dark themes
Professional styling: Clean, modern interface
Shareable: Email, cloud storage, or static hosting

💡 Use Cases

Data Science & ML

EDA - Understand distributions, correlations, missing patterns
Feature engineering - Identify high-cardinality, constant columns
Data validation - Check quality before training
Reproducibility - Generate consistent reports with seeds

Data Engineering

Pipeline monitoring - Track data quality over time
CI/CD checks - Assert quality thresholds
Documentation - Auto-generate data dictionaries
Debugging - Quickly profile large production datasets

Business Analytics

Dashboard generation - Automated reporting
Data documentation - Share with stakeholders
Quality assurance - Catch issues early
Compliance - Document data characteristics

⚙️ Configuration

Highly customizable via ReportConfig:

from pysuricata import profile, ReportConfig

config = ReportConfig()

# Processing
config.compute.chunk_size = 200_000  # Rows per chunk
config.compute.numeric_sample_size = 20_000  # Sample for quantiles
config.compute.random_seed = 42  # Reproducibility

# Analysis
config.compute.compute_correlations = True  # Enable correlations
config.compute.corr_threshold = 0.5  # Min |r| to show
config.compute.top_k_size = 50  # Top values to track

# Rendering
config.render.title = "My Analysis Report"
config.render.description = "Custom markdown description"
config.render.include_sample = True
config.render.sample_rows = 10

report = profile(df, config=config)

See Configuration Guide for all options.

🔬 Statistical Methods

Numeric Variables

Central tendency: mean, median
Dispersion: variance, std, IQR, MAD, CV
Shape: skewness, kurtosis
Quantiles: P1, P5, Q1, Q2, Q3, P95, P99
Outliers: IQR fences, z-score, MAD-based
Distribution: histogram (Freedman-Diaconis binning)

Categorical Variables

Frequency table (top-k via Misra-Gries)
Distinct count (KMV sketch)
Shannon entropy, Gini impurity
String length statistics
Case/trim variant detection

DateTime Variables

Temporal range and span
Hour distribution (0-23)
Day-of-week distribution
Month distribution
Monotonicity coefficient
Timeline visualization

Boolean Variables

True/false counts and percentages
Shannon entropy
Balance score
Imbalance ratio

Full mathematical formulas and derivations in Statistical Methods.

📈 Scalability

PySuricata scales linearly with dataset size:

Dataset Size	Processing Time	Peak Memory
10K rows	1s	30 MB
100K rows	5s	50 MB
1M rows	15s	50 MB
10M rows	150s	50 MB
100M rows	1,500s (25 min)	50 MB
1B rows	15,000s (4 hrs)	50 MB

Memory stays constant regardless of dataset size! 🎉

🤝 Why "Suricata"?

Inspired by suricatas (meerkats) - small, vigilant animals that work cooperatively to survive in harsh desert environments:

👀 Watchful - Always scanning for patterns (like data analysis)
🤝 Cooperative - Parallel/distributed processing
🏜️ Efficient - Thrive with limited resources (bounded memory)
⚡ Quick - Fast reactions (streaming algorithms)
🔍 Pattern recognition - Identify important signals

Learn more about why suricatas inspired this library.

📖 Documentation

Comprehensive documentation with mathematical formulas, algorithm details, and examples:

📘 Quick Start Guide - Get started in 5 minutes
📗 User Guide - Detailed usage patterns
📕 API Reference - Complete API documentation
📙 Statistical Methods - Mathematical formulas
📔 Algorithms - Welford, Pébay, KMV, Misra-Gries
📓 Performance Tips - Optimization strategies
📰 Examples Gallery - Real-world use cases

🛠️ Advanced Features

Distributed Processing

Accumulators are mergeable - compute on multiple machines and combine:

from pysuricata.accumulators import NumericAccumulator

# Worker 1
acc1 = NumericAccumulator("amount")
acc1.update(data_partition_1)

# Worker 2  
acc2 = NumericAccumulator("amount")
acc2.update(data_partition_2)

# Merge on coordinator (exact, no approximation)
acc1.merge(acc2)
final_stats = acc1.finalize()

CI/CD Integration

from pysuricata import summarize

def validate_data_quality(df):
    stats = summarize(df)  # Fast, stats-only
    
    assert stats["dataset"]["missing_cells_pct"] < 5.0, "Too many missing"
    assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0, "Too many duplicates"
    
    print("✓ Data quality checks passed")

# In your pipeline
validate_data_quality(df)

Jupyter Notebooks

from pysuricata import profile

report = profile(df)
report  # Auto-displays inline

# Or with custom size
report.display_in_notebook(height="800px")

🎓 Academic Use

If you use PySuricata in academic research, please reference the algorithms:

Streaming moments:

Welford, B.P. (1962), "Note on a Method for Calculating Corrected Sums of Squares and Products", Technometrics
Pébay, P. (2008), "Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments", Sandia Report

Sketch algorithms:

Bar-Yossef, Z. et al. (2002), "Counting Distinct Elements in a Data Stream", RANDOM
Misra, J., Gries, D. (1982), "Finding repeated elements", Science of Computer Programming

See References for complete citations.

🗺️ Roadmap

Current Version (0.0.11)

✅ Streaming architecture
✅ 4 variable types (numeric, categorical, datetime, boolean)
✅ Streaming correlations
✅ Missing value analysis
✅ Polars support
✅ Comprehensive documentation

Planned Features

🔜 Spearman rank correlation (monotonic relationships)
🔜 Little's MCAR test (missing data mechanism)
🔜 Chi-square uniformity test (categorical distribution)
🔜 Seasonality detection (autocorrelation for datetime)
🔜 Gap analysis (missing time periods)
🔜 Profile comparison (compare two datasets)
🔜 Export to JSON/CSV (structured statistics)
🔜 CLI tool (command-line interface)
🔜 Dask integration (native distributed support)

🤝 Contributing

We welcome contributions! See our Contributing Guide to get started.

Ways to contribute:

🐛 Report bugs
💡 Suggest features
📝 Improve documentation
🧪 Add tests
🔧 Submit pull requests
💬 Help others in Discussions

Development Setup

# Clone repository
git clone https://github.com/alvarodiez20/pysuricata.git
cd pysuricata

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Build documentation
uv run mkdocs serve

📊 Project Stats

Lines of code: ~15,000
Test coverage: 90%+
Documentation pages: 25+
Supported Python versions: 3.9, 3.10, 3.11, 3.12, 3.13
Active development: Regular releases

🆘 Support & Community

📜 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built with inspiration from:

Suricatas (meerkats) - Vigilant, cooperative, efficient 🦦
Welford & Pébay - Streaming moments algorithms
Bar-Yossef, Misra & Gries - Sketch algorithms
Open-source community

⭐ Star History

If you find PySuricata useful, please star the repository!

Ready to analyze like a suricata?

📚 Read the Docs • 🐛 Report Bug • 💬 Discussions

"In the Kalahari of big data, be a suricata - vigilant, efficient, and always ready to dig for insights!"

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.0.16

Feb 15, 2026

0.0.15

Feb 14, 2026

0.0.14

Feb 14, 2026

0.0.13

Jan 2, 2026

This version

0.0.12

Oct 25, 2025

0.0.11

Oct 12, 2025

0.0.10

Oct 3, 2025

0.0.9

Sep 10, 2025

0.0.8

Sep 5, 2025

0.0.7

Sep 3, 2025

0.0.6

Aug 27, 2025

0.0.5

Aug 27, 2025

0.0.4

Mar 29, 2025

0.0.3

Mar 29, 2025

0.0.1

Mar 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysuricata-0.0.12.tar.gz (691.8 kB view details)

Uploaded Oct 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pysuricata-0.0.12-py3-none-any.whl (698.1 kB view details)

Uploaded Oct 25, 2025 Python 3

File details

Details for the file pysuricata-0.0.12.tar.gz.

File metadata

Download URL: pysuricata-0.0.12.tar.gz
Upload date: Oct 25, 2025
Size: 691.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for pysuricata-0.0.12.tar.gz
Algorithm	Hash digest
SHA256	`add261df7fcb83faa81b21d3e3fbb43b69ad98e940e06c06b88e93a6cca6ef4d`
MD5	`5bcc1eca1eea605d232a730a84fdc363`
BLAKE2b-256	`7aed1b73fed8b47dce1ddd4febc6dd4dc3edd647be46bf5d764eab76582f9f55`

See more details on using hashes here.

File details

Details for the file pysuricata-0.0.12-py3-none-any.whl.

File metadata

Download URL: pysuricata-0.0.12-py3-none-any.whl
Upload date: Oct 25, 2025
Size: 698.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for pysuricata-0.0.12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f350963960577235f34240029282c6f372f6608c9a048a8f27504872ffd33cd`
MD5	`3957e5bb2fb3d32bc9bcda10e5a36e91`
BLAKE2b-256	`3989585722f4bda96ac12eb65952aba36b16e9bd1bfd30ade3ea70ad9b202493`

See more details on using hashes here.

pysuricata 0.0.12

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PySuricata 🦦

Lightweight, High-Performance Exploratory Data Analysis for Python

✨ Features

Quick Start

Installation

Generate Your First Report

Why PySuricata?

🆚 Comparison with Alternatives

📊 Performance Benchmarks

🎯 What's in a Report?

Dataset Overview

Variable Analysis (4 types)

Advanced Analytics

📚 Examples

Small Dataset (In-Memory)

Large Dataset (Streaming)

Polars Support

Statistics Only (No HTML)

Reproducible Reports

Custom Description (Markdown Support)

🧮 Algorithms & Mathematical Rigor

Exact Statistics

Approximate Statistics (with guarantees)

🎨 Report Highlights

💡 Use Cases

Data Science & ML

Data Engineering

Business Analytics

⚙️ Configuration

🔬 Statistical Methods

Numeric Variables

Categorical Variables

DateTime Variables

Boolean Variables

📈 Scalability

🤝 Why "Suricata"?

📖 Documentation

🛠️ Advanced Features

Distributed Processing

CI/CD Integration

Jupyter Notebooks

🎓 Academic Use

🗺️ Roadmap

Current Version (0.0.11)

Planned Features

🤝 Contributing

Development Setup

📊 Project Stats

🆘 Support & Community

📜 License

🙏 Acknowledgments

⭐ Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes