Skip to main content

Dataset dissector: polars-native profiling, HTML + JSON reports, optional multi-provider LLM insight pass, live Flask viewer

Project description

saturn

Dataset dissector. Point it at a HuggingFace repo, a local file, or a slice of either and it produces a terminal summary, a self-contained HTML report, and a machine-readable JSON findings file. Stats pass is always free and deterministic. Language-model insight and topic clustering are opt-in.

Generic across domains: alt-text, Bluesky firehose, census tables, VQA annotations all work out of the box.

Primary use case: lukeslp/bluesky-alt-text, 404,841 image descriptions, profiled in 46 s, compared across the curated/firehose split in 18 s.

Install

python3.10 -m venv venv
source venv/bin/activate
pip install -e '.[nlp]'
# optional, enables true full-corpus language detection:
mkdir -p .cache/saturn
curl -sL -o .cache/saturn/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Python 3.10+ required.

Use

# Full-corpus profile of a HuggingFace dataset
saturn huggingface lukeslp/bluesky-alt-text

# Local file (CSV, JSONL, Parquet, SQLite)
saturn analyze path/to/data.csv

# Compare two slices of one dataset by a column value
saturn compare lukeslp/bluesky-alt-text --by source_mode \
    --label-a curated --label-b firehose

# Compare two independent sources
saturn compare hf://user/a hf://user/b

# Opt-in streaming sample when full load is too heavy
saturn analyze big-dataset.parquet --sample 5000

# Opt-in LLM insight pass (primary, with optional catfish critic)
saturn analyze data.csv --llm anthropic
saturn analyze data.csv --llm anthropic:claude-sonnet-4-6 --llm openai:gpt-4o-mini

Output

  1. Terminal: row count, per-column type, null %, unique count, alerts (duplicates, high_skew, outliers, multilingual, near_unique, boilerplate, allcaps, one_word, url_heavy, and so on).
  2. HTML report: one self-contained file, TOC with per-column charts and stats tables. Compare mode adds a "most divergent columns" summary driven by a composite score.
  3. JSON findings: every number and string in the HTML, ready to feed a notebook generator or the live viewer.

Viewer

pip install -e '.[web]'
saturn serve --dir path/to/findings/ --port 5043
open http://127.0.0.1:5043

A WCAG 2.2 AA compliant live alternative to the static HTML report. Drop findings JSON files into a directory; refreshing the index picks up new runs without restarting. /api/findings/<id> returns the raw JSON for scripting.

Design

  • Polars-native, full-corpus by default. 404K rows of 21-column Bluesky data profiled in under a minute. Opt-in --sample N for quick peeks on anything bigger.
  • Bounded memory. duplicate_counter and vocab expansion get skipped on JSON-blob-shaped columns; near-unique columns skip value-counts that would return 400K singletons; vocab tokenisation is capped to a 20K-row subsample truncated to 500 chars per row.
  • fasttext lid.176 for true full-corpus language detection when the model is present (~1M docs/sec). Falls back to a bounded langdetect sample otherwise.
  • Two passes. A free deterministic stats pass (always runs) and an opt-in language-model insight pass (Phase 2, shipped).
  • Schema inference uses absolute and relative cardinality: a 489-value column in a 404K-row corpus is categorical, not text, even though 489 > the absolute threshold.
  • Compare mode is the dataset's feature. Diff two slices column-by-column; every delta (null drift, mean/length delta, entropy delta, top-value jaccard, language-mix jaccard) is both visible in the HTML and machine-readable in the JSON.

Real output

Running saturn compare lukeslp/bluesky-alt-text --by source_mode on the full 404,841-row corpus (12 to 18 s):

signal curated (279K) firehose (125K) Δ
alt_text mean length 202 chars 281 chars +79 chars
alt_text duplicate rate 8.5% 20.0% +11.5pp
language jaccard . . 0.35 (wide divergence)
author_handle null 0% 100% +100% (firehose is anonymised)
cursor duplicate rate 75% 17% −58pp

The +79 chars on firehose vs curated was the non-obvious finding: the curated 489-account population writes shorter alt text than the broader stream. Worth a Concadia-style readability follow-up.

LLM insight pass (opt-in)

Pass --llm provider[:model] on analyze or huggingface to layer a narrated insight pass on top of the deterministic stats. Pass the flag twice and the second provider plays catfish critic: it reviews the first model's narrative against the same evidence and returns agree/disagree/partial. Insights land in both the HTML report and the JSON findings (key: insights). The pass fails open: provider errors or missing API keys never block the deterministic output; they are recorded in insights.errors and saturn exits 0.

Requires ~/shared/llm_providers on PYTHONPATH (the unified provider gateway). Supported providers: anthropic, openai, groq, gemini, mistral, cohere, xai, perplexity, huggingface, ollama.

Status

Phase 1 (stats + HTML + JSON), Phase 2 (--llm insight pass with catfish critic, analyze/huggingface), Phase 4 (compare mode, curated-vs-firehose diff), and Phase 5 (Flask viewer on port 5043) are all shipping. Phase 2.5 (compare-mode insights) and Phase 3 (BERTopic clustering) on the roadmap.

License

MIT. © Luke Steuber.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saturn_dissect-0.2.0.tar.gz (83.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saturn_dissect-0.2.0-py3-none-any.whl (51.8 kB view details)

Uploaded Python 3

File details

Details for the file saturn_dissect-0.2.0.tar.gz.

File metadata

  • Download URL: saturn_dissect-0.2.0.tar.gz
  • Upload date:
  • Size: 83.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for saturn_dissect-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0ea0a889a6d970904ba79caae816706c67a1a39a9f428de0c420c8aabd90e7b0
MD5 2d09b4e5ed9fb1c206809df939f11c8b
BLAKE2b-256 9adc82f73a07b3812539bc0147ea3b71f74f9fed7c0a51e03b693ce3513df65e

See more details on using hashes here.

File details

Details for the file saturn_dissect-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: saturn_dissect-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 51.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for saturn_dissect-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dd99f13dbaa36ece55842c243779158c094018850bc89cf64e6e99ec86686f7a
MD5 0e8f062dea028504f7d55ebdda8c05ec
BLAKE2b-256 fb46f701ab808b3175bebae8df31953af94bf6a22e5155e3fb1506f090171d0c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page