Dataset dissector: polars-native profiling, HTML + JSON reports, optional multi-provider LLM insight pass, live Flask viewer
Project description
saturn
Dataset dissector. Point it at a HuggingFace repo, a local file, or a slice of either and it produces a terminal summary, a self-contained HTML report, and a machine-readable JSON findings file. Stats pass is always free and deterministic. Language-model insight and topic clustering are opt-in.
Generic across domains: alt-text, Bluesky firehose, census tables, VQA annotations all work out of the box.
Primary use case: lukeslp/bluesky-alt-text, 404,841 image descriptions, profiled in 46 s, compared across the curated/firehose split in 18 s.
Install
python3.10 -m venv venv
source venv/bin/activate
pip install -e '.[nlp]'
# optional, enables true full-corpus language detection:
mkdir -p .cache/saturn
curl -sL -o .cache/saturn/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Python 3.10+ required.
Use
# Full-corpus profile of a HuggingFace dataset
saturn huggingface lukeslp/bluesky-alt-text
# Local file (CSV, JSONL, Parquet, SQLite)
saturn analyze path/to/data.csv
# Compare two slices of one dataset by a column value
saturn compare lukeslp/bluesky-alt-text --by source_mode \
--label-a curated --label-b firehose
# Compare two independent sources
saturn compare hf://user/a hf://user/b
# Opt-in streaming sample when full load is too heavy
saturn analyze big-dataset.parquet --sample 5000
# Opt-in LLM insight pass (primary, with optional catfish critic)
saturn analyze data.csv --llm anthropic
saturn analyze data.csv --llm anthropic:claude-sonnet-4-6 --llm openai:gpt-4o-mini
Output
- Terminal: row count, per-column type, null %, unique count, alerts (
duplicates,high_skew,outliers,multilingual,near_unique,boilerplate,allcaps,one_word,url_heavy, and so on). - HTML report: one self-contained file, TOC with per-column charts and stats tables. Compare mode adds a "most divergent columns" summary driven by a composite score.
- JSON findings: every number and string in the HTML, ready to feed a notebook generator or the live viewer.
Viewer
pip install -e '.[web]'
saturn serve --dir path/to/findings/ --port 5043
open http://127.0.0.1:5043
A WCAG 2.2 AA compliant live alternative to the static HTML report. Drop findings JSON files into a directory; refreshing the index picks up new runs without restarting. /api/findings/<id> returns the raw JSON for scripting.
Design
- Polars-native, full-corpus by default. 404K rows of 21-column Bluesky data profiled in under a minute. Opt-in
--sample Nfor quick peeks on anything bigger. - Bounded memory.
duplicate_counterand vocab expansion get skipped on JSON-blob-shaped columns; near-unique columns skip value-counts that would return 400K singletons; vocab tokenisation is capped to a 20K-row subsample truncated to 500 chars per row. - fasttext lid.176 for true full-corpus language detection when the model is present (~1M docs/sec). Falls back to a bounded
langdetectsample otherwise. - Two passes. A free deterministic stats pass (always runs) and an opt-in language-model insight pass (Phase 2, shipped).
- Schema inference uses absolute and relative cardinality: a 489-value column in a 404K-row corpus is categorical, not text, even though 489 > the absolute threshold.
- Compare mode is the dataset's feature. Diff two slices column-by-column; every delta (null drift, mean/length delta, entropy delta, top-value jaccard, language-mix jaccard) is both visible in the HTML and machine-readable in the JSON.
Real output
Running saturn compare lukeslp/bluesky-alt-text --by source_mode on the full 404,841-row corpus (12 to 18 s):
| signal | curated (279K) | firehose (125K) | Δ |
|---|---|---|---|
alt_text mean length |
202 chars | 281 chars | +79 chars |
alt_text duplicate rate |
8.5% | 20.0% | +11.5pp |
| language jaccard | . | . | 0.35 (wide divergence) |
author_handle null |
0% | 100% | +100% (firehose is anonymised) |
cursor duplicate rate |
75% | 17% | −58pp |
The +79 chars on firehose vs curated was the non-obvious finding: the curated 489-account population writes shorter alt text than the broader stream. Worth a Concadia-style readability follow-up.
LLM insight pass (opt-in)
Pass --llm provider[:model] on analyze or huggingface to layer a narrated insight pass on top of the deterministic stats. Pass the flag twice and the second provider plays catfish critic: it reviews the first model's narrative against the same evidence and returns agree/disagree/partial. Insights land in both the HTML report and the JSON findings (key: insights). The pass fails open: provider errors or missing API keys never block the deterministic output; they are recorded in insights.errors and saturn exits 0.
Requires ~/shared/llm_providers on PYTHONPATH (the unified provider gateway). Supported providers: anthropic, openai, groq, gemini, mistral, cohere, xai, perplexity, huggingface, ollama.
Status
Phase 1 (stats + HTML + JSON), Phase 2 (--llm insight pass with catfish critic, analyze/huggingface), Phase 4 (compare mode, curated-vs-firehose diff), and Phase 5 (Flask viewer on port 5043) are all shipping. Phase 2.5 (compare-mode insights) and Phase 3 (BERTopic clustering) on the roadmap.
License
MIT. © Luke Steuber.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file saturn_dissect-0.2.0.tar.gz.
File metadata
- Download URL: saturn_dissect-0.2.0.tar.gz
- Upload date:
- Size: 83.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ea0a889a6d970904ba79caae816706c67a1a39a9f428de0c420c8aabd90e7b0
|
|
| MD5 |
2d09b4e5ed9fb1c206809df939f11c8b
|
|
| BLAKE2b-256 |
9adc82f73a07b3812539bc0147ea3b71f74f9fed7c0a51e03b693ce3513df65e
|
File details
Details for the file saturn_dissect-0.2.0-py3-none-any.whl.
File metadata
- Download URL: saturn_dissect-0.2.0-py3-none-any.whl
- Upload date:
- Size: 51.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd99f13dbaa36ece55842c243779158c094018850bc89cf64e6e99ec86686f7a
|
|
| MD5 |
0e8f062dea028504f7d55ebdda8c05ec
|
|
| BLAKE2b-256 |
fb46f701ab808b3175bebae8df31953af94bf6a22e5155e3fb1506f090171d0c
|