Minimal-tuning CLI tool for data stats and clarity checks

Project description

sancheck

What is this?

sancheck is a minimal-tuning CLI tool for quickly assessing the statistical cleanliness of numeric columns in CSV datasets.
It provides a fast, high-level overview before deeper analysis or modeling.

When should I use it?

Before exploratory data analysis (EDA)
Before training statistical or machine learning models
When you want a quick sanity check without manual inspection

What it does NOT do

It does not clean or modify data
It does not model relationships
It does not replace proper EDA or data validation pipelines

Quick start

Run the tool on a CSV file:

sancheck [csv_data] [n_feature_per_plot or 'all']

Example output

Summary of columns

Valid numeric columns: 9
Ignored non-numeric columns: 7

📌 Column problems

Column with NaN/Inf/invalid: 0
- Age: invalid=0/5000 (0.000)
- Class: invalid=0/5000 (0.000)
- Study_Hours_Per_Day: invalid=0/5000 (0.000)
- Attendance_Percentage: invalid=0/5000 (0.000)
- Math_Score: invalid=0/5000 (0.000)
- Science_Score: invalid=0/5000 (0.000)
- English_Score: invalid=0/5000 (0.000)
- Previous_Year_Score: invalid=0/5000 (0.000)
- Final_Percentage: invalid=0/5000 (0.000)
Type inconsistency column: 0
- Age: bad_type=0 (0.000)
- Class: bad_type=0 (0.000)
- Study_Hours_Per_Day: bad_type=0 (0.000)
- Attendance_Percentage: bad_type=0 (0.000)
- Math_Score: bad_type=0 (0.000)
- Science_Score: bad_type=0 (0.000)
- English_Score: bad_type=0 (0.000)
- Previous_Year_Score: bad_type=0 (0.000)
- Final_Percentage: bad_type=0 (0.000)
Similar feature pairs (|corr| >= 0.95):
- Severity similarity: 0.000
- English_Score <-> Final_Percentage: |corr|=0.592
- Science_Score <-> Final_Percentage: |corr|=0.572
- Math_Score <-> Final_Percentage: |corr|=0.564
- Study_Hours_Per_Day <-> Science_Score: |corr|=0.038
- Class <-> Attendance_Percentage: |corr|=0.035
- Study_Hours_Per_Day <-> Attendance_Percentage: |corr|=0.027
- Class <-> Math_Score: |corr|=0.021
- Math_Score <-> Science_Score: |corr|=0.020
- Study_Hours_Per_Day <-> Previous_Year_Score: |corr|=0.020
- Age <-> Attendance_Percentage: |corr|=0.019

📌 Row problems

Problematic rows (NaN/Inf): 0/5000
Severity row: 0.132
- row 4867: score=0.625, invalid=False
- row 82: score=0.624, invalid=False
- row 1364: score=0.622, invalid=False
- row 1482: score=0.619, invalid=False
- row 4913: score=0.611, invalid=False

📌 Distribution / interpretation

High entropy means the distribution is more even/complex; it's not automatically 'noise', it can also be multimodal.
High spread score means the data is more dispersed robustly compared to its central tendency.

Top entropy:
- Attendance_Percentage: entropy=1.000 (very spread / more uniform or complex distribution)
- Science_Score: entropy=0.998 (very spread / more uniform or complex distribution)
- English_Score: entropy=0.998 (very spread / more uniform or complex distribution)
- Math_Score: entropy=0.997 (very spread / more uniform or complex distribution)
- Study_Hours_Per_Day: entropy=0.997 (very spread / more uniform or complex distribution)
Top spread:
- Class: spread_score=0.690 (wide / large variation), var=1.225, iqr=1.000
- Final_Percentage: spread_score=0.471 (moderate / moderate variation), var=120.211, iqr=15.660
- Math_Score: spread_score=0.384 (moderate / moderate variation), var=350.606, iqr=32.000
- Science_Score: spread_score=0.380 (moderate / moderate variation), var=366.385, iqr=33.000
- Previous_Year_Score: spread_score=0.377 (moderate / moderate variation), var=261.065, iqr=28.000

📌 Normality (per fitur)

Age: Shapiro=0.000 | KS=0.000
Class: Shapiro=0.000 | KS=0.000
Study_Hours_Per_Day: Shapiro=0.000 | KS=0.000
Attendance_Percentage: Shapiro=0.000 | KS=0.000
Math_Score: Shapiro=0.000 | KS=0.000
Science_Score: Shapiro=0.000 | KS=0.000
English_Score: Shapiro=0.000 | KS=0.000
Previous_Year_Score: Shapiro=0.000 | KS=0.000
Final_Percentage: Shapiro=0.000 | KS=0.038

🔨 Final status

clarity score: 0.961 / 1.000
clarity label: very clean
missing severity: 0.000
type severity: 0.000
similarity severity: 0.000
row severity: 0.132

📊 Dataset-level distribution summary

avg entropy: 0.979
avg spread score: 0.420

Interpretation tips

Higher clarity scores indicate cleaner numeric data
Anomalous rows are ranked, not classified — use them for inspection
Non-numeric columns are ignored by design
This tool is best used as a fast pre-analysis step

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Apr 13, 2026

0.1.0

Apr 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sancheck-0.1.1.tar.gz (9.9 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sancheck-0.1.1-py3-none-any.whl (9.5 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file sancheck-0.1.1.tar.gz.

File metadata

Download URL: sancheck-0.1.1.tar.gz
Upload date: Apr 13, 2026
Size: 9.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for sancheck-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`902dafed795097b2fb6423300680640b604d793a7b802b5d256e5626a12a30e3`
MD5	`6052730ef2ddf1a7a3e6c9cfe646d797`
BLAKE2b-256	`29d54319c6d5ae150df009382a36159684b4c854306de54f5e19b52be8036194`

See more details on using hashes here.

File details

Details for the file sancheck-0.1.1-py3-none-any.whl.

File metadata

Download URL: sancheck-0.1.1-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 9.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for sancheck-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`57c2b0cf8f6e8fdde6813f79277d0bdaaa007160e56db520d6914d077f4fcd4d`
MD5	`a05f33903353d0b752f46532c0b60f4a`
BLAKE2b-256	`9d4b4f65b259f225ef1a64a1abd861600419fcfbfcf5d9f7e8d900227640488b`

See more details on using hashes here.

sancheck 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

sancheck

What is this?

When should I use it?

What it does NOT do

Quick start

Example output

Interpretation tips

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes