Skip to main content

A lightweight EDA tool inspired by the curious nature of suricates. Built just for fun 🔬.

Project description

PySuricata

Build Status PyPI version Python versions License: MIT codecov Documentation Downloads LinkedIn

PySuricata Logo

Exploratory Data Analysis for Python, Built on Streaming Algorithms

Quick StartDocumentationExamples


What It Does

PySuricata generates self-contained HTML reports from pandas or polars DataFrames. Reports include per-column statistics, histograms, correlation chips, missing value analysis, and outlier detection.

Data is processed in chunks using streaming algorithms, so memory usage stays bounded regardless of dataset size.

Quick Start

Installation

pip install pysuricata

With polars support:

pip install pysuricata[polars]

Generate a Report

import pandas as pd
from pysuricata import profile

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

report = profile(df)
report.save_html("titanic_report.html")

▶ See a live example report →

Features

  • Streaming architecture — Data is processed in configurable chunks, keeping memory bounded. Useful for datasets that don't fit in RAM.
  • Pandas and Polars — Works natively with pandas.DataFrame, polars.DataFrame, and polars.LazyFrame.
  • Self-contained HTML — Single file with inline CSS, JS, and SVG charts. No external assets needed.
  • Configurable — Control chunk sizes, sample sizes, sketch parameters, and correlation thresholds via ReportConfig.
  • Reproducible — Seeded random sampling produces deterministic results across runs.
  • CLI tool — Profile datasets from the command line.

How It Works

PySuricata uses well-known streaming algorithms from the academic literature:

Algorithm Purpose Time Space
Welford/Pébay Exact mean, variance, skewness, kurtosis O(1) per value O(1)
KMV sketch Distinct count estimation (~2.2% error) O(log k) per value O(k)
Misra-Gries Top-k frequent values O(1) amortized O(k)
Reservoir sampling Uniform random sample for quantiles O(1) per value O(s)

k = sketch size (default 1024), s = sample size (default 10 000)

All statistics are computed in a single pass over the data.

What's in a Report

Each column is analyzed based on its type:

  • Numeric — Mean, variance, skewness, kurtosis, quantiles, histogram, outlier detection (IQR, MAD, z-score), correlations
  • Categorical — Top values, distinct count, entropy, Gini impurity, string length statistics
  • DateTime — Temporal range, hour/day/month distributions, monotonicity detection
  • Boolean — True/false ratios, entropy, balance score

Plus dataset-level metrics: row/column counts, memory usage, missing value percentages, and duplicate row estimates.

Streaming Large Datasets

Process datasets larger than RAM by passing a generator:

import pandas as pd
from pysuricata import profile

def read_in_chunks():
    for i in range(100):
        yield pd.read_parquet(f"data/part-{i}.parquet")

report = profile(read_in_chunks())
report.save_html("large_report.html")

Statistics Only (No HTML)

Use summarize() for CI/CD quality checks:

from pysuricata import summarize

stats = summarize(df)

assert stats["dataset"]["missing_cells_pct"] < 5.0
assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0

print(f"Mean age: {stats['columns']['age']['mean']:.1f}")

Configuration

from pysuricata import profile, ReportConfig

config = ReportConfig()
config.compute.chunk_size = 250_000
config.compute.random_seed = 42
config.compute.compute_correlations = True
config.compute.corr_threshold = 0.5
config.render.title = "My Analysis"

report = profile(df, config=config)

See the Configuration Guide for all options.

CLI

# Generate HTML report
pysuricata profile data.csv --output report.html

# Get JSON statistics
pysuricata summarize data.csv

Documentation

Contributing

Contributions are welcome. See the Contributing Guide.

git clone https://github.com/alvarodiez20/pysuricata.git
cd pysuricata
uv sync --dev
uv run pytest

License

MIT License. See LICENSE for details.

Acknowledgments

Built using algorithms from:

  • Welford, B.P. (1962) — Streaming moments
  • Pébay, P. (2008) — Parallel merging of moments
  • Bar-Yossef, Z. et al. (2002) — KMV distinct count estimation
  • Misra, J. & Gries, D. (1982) — Streaming heavy hitters

Named after suricatas (meerkats) — small, vigilant animals that work cooperatively and thrive in harsh environments with limited resources.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysuricata-0.0.16.tar.gz (779.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysuricata-0.0.16-py3-none-any.whl (776.0 kB view details)

Uploaded Python 3

File details

Details for the file pysuricata-0.0.16.tar.gz.

File metadata

  • Download URL: pysuricata-0.0.16.tar.gz
  • Upload date:
  • Size: 779.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pysuricata-0.0.16.tar.gz
Algorithm Hash digest
SHA256 57db916a2632304daa4cb469440f485ea1c25847453d0b3242bde6f2a0476162
MD5 ab429553752447466f38774b1a9c2a09
BLAKE2b-256 d808bb8d71ad1e9e8eb574725335c681a4bb7a2690d61f6917ca14d707bc922a

See more details on using hashes here.

File details

Details for the file pysuricata-0.0.16-py3-none-any.whl.

File metadata

  • Download URL: pysuricata-0.0.16-py3-none-any.whl
  • Upload date:
  • Size: 776.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pysuricata-0.0.16-py3-none-any.whl
Algorithm Hash digest
SHA256 64f18cbb0152e49fdebb31afafce6a3ce0028e5a20aa73a1e502d269fdeba2ff
MD5 ba47806aeb310640fff3a52002178a88
BLAKE2b-256 661c474c177c48c526d8c3497fadadb1f538e5ac64bc98e9344db55964d3718e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page