A lightweight EDA tool inspired by the curious nature of suricates. Built just for fun 🔬.
Project description
PySuricata
Exploratory Data Analysis for Python, Built on Streaming Algorithms
What It Does
PySuricata generates self-contained HTML reports from pandas or polars DataFrames. Reports include per-column statistics, histograms, correlation chips, missing value analysis, and outlier detection.
Data is processed in chunks using streaming algorithms, so memory usage stays bounded regardless of dataset size.
Quick Start
Installation
pip install pysuricata
With polars support:
pip install pysuricata[polars]
Generate a Report
import pandas as pd
from pysuricata import profile
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
report = profile(df)
report.save_html("titanic_report.html")
Features
- Streaming architecture — Data is processed in configurable chunks, keeping memory bounded. Useful for datasets that don't fit in RAM.
- Pandas and Polars — Works natively with
pandas.DataFrame,polars.DataFrame, andpolars.LazyFrame. - Self-contained HTML — Single file with inline CSS, JS, and SVG charts. No external assets needed.
- Configurable — Control chunk sizes, sample sizes, sketch parameters, and correlation thresholds via
ReportConfig. - Reproducible — Seeded random sampling produces deterministic results across runs.
- CLI tool — Profile datasets from the command line.
How It Works
PySuricata uses well-known streaming algorithms from the academic literature:
| Algorithm | Purpose | Time | Space |
|---|---|---|---|
| Welford/Pébay | Exact mean, variance, skewness, kurtosis | O(1) per value | O(1) |
| KMV sketch | Distinct count estimation (~2.2% error) | O(log k) per value | O(k) |
| Misra-Gries | Top-k frequent values | O(1) amortized | O(k) |
| Reservoir sampling | Uniform random sample for quantiles | O(1) per value | O(s) |
k = sketch size (default 1024), s = sample size (default 10 000)
All statistics are computed in a single pass over the data.
What's in a Report
Each column is analyzed based on its type:
- Numeric — Mean, variance, skewness, kurtosis, quantiles, histogram, outlier detection (IQR, MAD, z-score), correlations
- Categorical — Top values, distinct count, entropy, Gini impurity, string length statistics
- DateTime — Temporal range, hour/day/month distributions, monotonicity detection
- Boolean — True/false ratios, entropy, balance score
Plus dataset-level metrics: row/column counts, memory usage, missing value percentages, and duplicate row estimates.
Streaming Large Datasets
Process datasets larger than RAM by passing a generator:
import pandas as pd
from pysuricata import profile
def read_in_chunks():
for i in range(100):
yield pd.read_parquet(f"data/part-{i}.parquet")
report = profile(read_in_chunks())
report.save_html("large_report.html")
Statistics Only (No HTML)
Use summarize() for CI/CD quality checks:
from pysuricata import summarize
stats = summarize(df)
assert stats["dataset"]["missing_cells_pct"] < 5.0
assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0
print(f"Mean age: {stats['columns']['age']['mean']:.1f}")
Configuration
from pysuricata import profile, ReportConfig
config = ReportConfig()
config.compute.chunk_size = 250_000
config.compute.random_seed = 42
config.compute.compute_correlations = True
config.compute.corr_threshold = 0.5
config.render.title = "My Analysis"
report = profile(df, config=config)
See the Configuration Guide for all options.
CLI
# Generate HTML report
pysuricata profile data.csv --output report.html
# Get JSON statistics
pysuricata summarize data.csv
Documentation
Contributing
Contributions are welcome. See the Contributing Guide.
git clone https://github.com/alvarodiez20/pysuricata.git
cd pysuricata
uv sync --dev
uv run pytest
License
MIT License. See LICENSE for details.
Acknowledgments
Built using algorithms from:
- Welford, B.P. (1962) — Streaming moments
- Pébay, P. (2008) — Parallel merging of moments
- Bar-Yossef, Z. et al. (2002) — KMV distinct count estimation
- Misra, J. & Gries, D. (1982) — Streaming heavy hitters
Named after suricatas (meerkats) — small, vigilant animals that work cooperatively and thrive in harsh environments with limited resources.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pysuricata-0.0.15.tar.gz.
File metadata
- Download URL: pysuricata-0.0.15.tar.gz
- Upload date:
- Size: 776.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51b6709b3bd2c1fb3124c6f1f6d07229a5d58551ab3b86f244d05eb5e558d0bd
|
|
| MD5 |
fbcc9ea78d0ed53dc840f11f849901ea
|
|
| BLAKE2b-256 |
7668ef9af5d0bf43779e239ff58ae56dd2917c1590efc2bab9882424e819fa81
|
File details
Details for the file pysuricata-0.0.15-py3-none-any.whl.
File metadata
- Download URL: pysuricata-0.0.15-py3-none-any.whl
- Upload date:
- Size: 772.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36f7af4e9259a6675591c423e4abf99bfc3110a3e568020bff82315ccf6ef92e
|
|
| MD5 |
814ea734cad0c5d61b1bb622bb135fe9
|
|
| BLAKE2b-256 |
79a5d3db243393151c8491573481f2351f4c8b58138683e9ba4208935ac8a78c
|