Dataset dissector: polars-native profiling, HTML + JSON reports, optional multi-provider LLM insight pass, live Flask viewer

These details have not been verified by PyPI

Project links

Project description

saturn

Dataset dissector. Point it at a HuggingFace repo, a local file, or a slice of either and it produces a terminal summary, a self-contained HTML report, and a machine-readable JSON findings file. Stats pass is always free and deterministic. Language-model insight and topic clustering are opt-in.

Generic across domains: alt-text, Bluesky firehose, census tables, VQA annotations all work out of the box.

Primary use case: lukeslp/bluesky-alt-text, 404,841 image descriptions, profiled in 46 s, compared across the curated/firehose split in 18 s.

Install

python3.10 -m venv venv
source venv/bin/activate
pip install -e '.[nlp]'
# optional, enables true full-corpus language detection:
mkdir -p .cache/saturn
curl -sL -o .cache/saturn/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Python 3.10+ required.

Use

# Full-corpus profile of a HuggingFace dataset
saturn huggingface lukeslp/bluesky-alt-text

# Local file (CSV, JSONL, Parquet, SQLite)
saturn analyze path/to/data.csv

# Compare two slices of one dataset by a column value
saturn compare lukeslp/bluesky-alt-text --by source_mode \
    --label-a curated --label-b firehose

# Compare two independent sources
saturn compare hf://user/a hf://user/b

# Opt-in streaming sample when full load is too heavy
saturn analyze big-dataset.parquet --sample 5000

# Opt-in LLM insight pass (primary, with optional catfish critic)
saturn analyze data.csv --llm anthropic
saturn analyze data.csv --llm anthropic:claude-sonnet-4-6 --llm openai:gpt-4o-mini

Output

Terminal: row count, per-column type, null %, unique count, alerts (duplicates, high_skew, outliers, multilingual, near_unique, boilerplate, allcaps, one_word, url_heavy, and so on).
HTML report: one self-contained file, TOC with per-column charts and stats tables. Compare mode adds a "most divergent columns" summary driven by a composite score.
JSON findings: every number and string in the HTML, ready to feed a notebook generator or the live viewer.

Viewer

pip install -e '.[web]'
saturn serve --dir path/to/findings/ --port 5043
open http://127.0.0.1:5043

A WCAG 2.2 AA compliant live alternative to the static HTML report. Drop findings JSON files into a directory; refreshing the index picks up new runs without restarting. /api/findings/<id> returns the raw JSON for scripting.

Design

Polars-native, full-corpus by default. 404K rows of 21-column Bluesky data profiled in under a minute. Opt-in --sample N for quick peeks on anything bigger.
Bounded memory. duplicate_counter and vocab expansion get skipped on JSON-blob-shaped columns; near-unique columns skip value-counts that would return 400K singletons; vocab tokenisation is capped to a 20K-row subsample truncated to 500 chars per row.
fasttext lid.176 for true full-corpus language detection when the model is present (~1M docs/sec). Falls back to a bounded langdetect sample otherwise.
Two passes. A free deterministic stats pass (always runs) and an opt-in language-model insight pass (Phase 2, shipped).
Schema inference uses absolute and relative cardinality: a 489-value column in a 404K-row corpus is categorical, not text, even though 489 > the absolute threshold.
Compare mode is the dataset's feature. Diff two slices column-by-column; every delta (null drift, mean/length delta, entropy delta, top-value jaccard, language-mix jaccard) is both visible in the HTML and machine-readable in the JSON.

Real output

Running saturn compare lukeslp/bluesky-alt-text --by source_mode on the full 404,841-row corpus (12 to 18 s):

signal	curated (279K)	firehose (125K)	Δ
`alt_text` mean length	202 chars	281 chars	+79 chars
`alt_text` duplicate rate	8.5%	20.0%	+11.5pp
language jaccard	.	.	0.35 (wide divergence)
`author_handle` null	0%	100%	+100% (firehose is anonymised)
`cursor` duplicate rate	75%	17%	−58pp

The +79 chars on firehose vs curated was the non-obvious finding: the curated 489-account population writes shorter alt text than the broader stream. Worth a Concadia-style readability follow-up.

LLM insight pass (opt-in)

Pass --llm provider[:model] on analyze or huggingface to layer a narrated insight pass on top of the deterministic stats. Pass the flag twice and the second provider plays catfish critic: it reviews the first model's narrative against the same evidence and returns agree/disagree/partial. Insights land in both the HTML report and the JSON findings (key: insights). The pass fails open: provider errors or missing API keys never block the deterministic output; they are recorded in insights.errors and saturn exits 0.

Requires ~/shared/llm_providers on PYTHONPATH (the unified provider gateway). Supported providers: anthropic, openai, groq, gemini, mistral, cohere, xai, perplexity, huggingface, ollama.

Status

Phase 1 (stats + HTML + JSON), Phase 2 (--llm insight pass with catfish critic, analyze/huggingface), Phase 4 (compare mode, curated-vs-firehose diff), and Phase 5 (Flask viewer on port 5043) are all shipping. Phase 2.5 (compare-mode insights) and Phase 3 (BERTopic clustering) on the roadmap.

License

MIT. © Luke Steuber.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saturn_dissect-0.2.0.tar.gz (83.1 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

saturn_dissect-0.2.0-py3-none-any.whl (51.8 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file saturn_dissect-0.2.0.tar.gz.

File metadata

Download URL: saturn_dissect-0.2.0.tar.gz
Upload date: Apr 23, 2026
Size: 83.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for saturn_dissect-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0ea0a889a6d970904ba79caae816706c67a1a39a9f428de0c420c8aabd90e7b0`
MD5	`2d09b4e5ed9fb1c206809df939f11c8b`
BLAKE2b-256	`9adc82f73a07b3812539bc0147ea3b71f74f9fed7c0a51e03b693ce3513df65e`

See more details on using hashes here.

File details

Details for the file saturn_dissect-0.2.0-py3-none-any.whl.

File metadata

Download URL: saturn_dissect-0.2.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 51.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for saturn_dissect-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dd99f13dbaa36ece55842c243779158c094018850bc89cf64e6e99ec86686f7a`
MD5	`0e8f062dea028504f7d55ebdda8c05ec`
BLAKE2b-256	`fb46f701ab808b3175bebae8df31953af94bf6a22e5155e3fb1506f090171d0c`

See more details on using hashes here.

saturn-dissect 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

saturn

Install

Use

Output

Viewer

Design

Real output

LLM insight pass (opt-in)

Status

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes