Skip to main content

Minimal-tuning CLI tool for data stats and clarity checks

Project description

sancheck

What is this?

sancheck is a minimal-tuning CLI tool for quickly assessing the statistical cleanliness of numeric columns in CSV datasets.
It provides a fast, high-level overview before deeper analysis or modeling.

When should I use it?

  • Before exploratory data analysis (EDA)
  • Before training statistical or machine learning models
  • When you want a quick sanity check without manual inspection

What it does NOT do

  • It does not clean or modify data
  • It does not model relationships
  • It does not replace proper EDA or data validation pipelines

Quick start

Run the tool on a CSV file:

sancheck [csv_data] [n_feature_per_plot or 'all']

Example output

Summary of columns

  • Valid numeric columns: 9
  • Ignored non-numeric columns: 7

📌 Column problems

  • Column with NaN/Inf/invalid: 0
    • Age: invalid=0/5000 (0.000)
    • Class: invalid=0/5000 (0.000)
    • Study_Hours_Per_Day: invalid=0/5000 (0.000)
    • Attendance_Percentage: invalid=0/5000 (0.000)
    • Math_Score: invalid=0/5000 (0.000)
    • Science_Score: invalid=0/5000 (0.000)
    • English_Score: invalid=0/5000 (0.000)
    • Previous_Year_Score: invalid=0/5000 (0.000)
    • Final_Percentage: invalid=0/5000 (0.000)
  • Type inconsistency column: 0
    • Age: bad_type=0 (0.000)
    • Class: bad_type=0 (0.000)
    • Study_Hours_Per_Day: bad_type=0 (0.000)
    • Attendance_Percentage: bad_type=0 (0.000)
    • Math_Score: bad_type=0 (0.000)
    • Science_Score: bad_type=0 (0.000)
    • English_Score: bad_type=0 (0.000)
    • Previous_Year_Score: bad_type=0 (0.000)
    • Final_Percentage: bad_type=0 (0.000)
  • Similar feature pairs (|corr| >= 0.95):
    • Severity similarity: 0.000
    • English_Score <-> Final_Percentage: |corr|=0.592
    • Science_Score <-> Final_Percentage: |corr|=0.572
    • Math_Score <-> Final_Percentage: |corr|=0.564
    • Study_Hours_Per_Day <-> Science_Score: |corr|=0.038
    • Class <-> Attendance_Percentage: |corr|=0.035
    • Study_Hours_Per_Day <-> Attendance_Percentage: |corr|=0.027
    • Class <-> Math_Score: |corr|=0.021
    • Math_Score <-> Science_Score: |corr|=0.020
    • Study_Hours_Per_Day <-> Previous_Year_Score: |corr|=0.020
    • Age <-> Attendance_Percentage: |corr|=0.019

📌 Row problems

  • Problematic rows (NaN/Inf): 0/5000
  • Severity row: 0.132
    • row 4867: score=0.625, invalid=False
    • row 82: score=0.624, invalid=False
    • row 1364: score=0.622, invalid=False
    • row 1482: score=0.619, invalid=False
    • row 4913: score=0.611, invalid=False

📌 Distribution / interpretation

  • High entropy means the distribution is more even/complex; it's not automatically 'noise', it can also be multimodal.

  • High spread score means the data is more dispersed robustly compared to its central tendency.

    Top entropy:

    • Attendance_Percentage: entropy=1.000 (very spread / more uniform or complex distribution)
    • Science_Score: entropy=0.998 (very spread / more uniform or complex distribution)
    • English_Score: entropy=0.998 (very spread / more uniform or complex distribution)
    • Math_Score: entropy=0.997 (very spread / more uniform or complex distribution)
    • Study_Hours_Per_Day: entropy=0.997 (very spread / more uniform or complex distribution)

    Top spread:

    • Class: spread_score=0.690 (wide / large variation), var=1.225, iqr=1.000
    • Final_Percentage: spread_score=0.471 (moderate / moderate variation), var=120.211, iqr=15.660
    • Math_Score: spread_score=0.384 (moderate / moderate variation), var=350.606, iqr=32.000
    • Science_Score: spread_score=0.380 (moderate / moderate variation), var=366.385, iqr=33.000
    • Previous_Year_Score: spread_score=0.377 (moderate / moderate variation), var=261.065, iqr=28.000

📌 Normality (per fitur)

  • Age: Shapiro=0.000 | KS=0.000
  • Class: Shapiro=0.000 | KS=0.000
  • Study_Hours_Per_Day: Shapiro=0.000 | KS=0.000
  • Attendance_Percentage: Shapiro=0.000 | KS=0.000
  • Math_Score: Shapiro=0.000 | KS=0.000
  • Science_Score: Shapiro=0.000 | KS=0.000
  • English_Score: Shapiro=0.000 | KS=0.000
  • Previous_Year_Score: Shapiro=0.000 | KS=0.000
  • Final_Percentage: Shapiro=0.000 | KS=0.038

🔨 Final status

  • clarity score: 0.961 / 1.000
  • clarity label: very clean
  • missing severity: 0.000
  • type severity: 0.000
  • similarity severity: 0.000
  • row severity: 0.132

📊 Dataset-level distribution summary

  • avg entropy: 0.979
  • avg spread score: 0.420

Interpretation tips

  • Higher clarity scores indicate cleaner numeric data

  • Anomalous rows are ranked, not classified — use them for inspection

  • Non-numeric columns are ignored by design

  • This tool is best used as a fast pre-analysis step

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sancheck-0.1.1.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sancheck-0.1.1-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file sancheck-0.1.1.tar.gz.

File metadata

  • Download URL: sancheck-0.1.1.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for sancheck-0.1.1.tar.gz
Algorithm Hash digest
SHA256 902dafed795097b2fb6423300680640b604d793a7b802b5d256e5626a12a30e3
MD5 6052730ef2ddf1a7a3e6c9cfe646d797
BLAKE2b-256 29d54319c6d5ae150df009382a36159684b4c854306de54f5e19b52be8036194

See more details on using hashes here.

File details

Details for the file sancheck-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sancheck-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for sancheck-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 57c2b0cf8f6e8fdde6813f79277d0bdaaa007160e56db520d6914d077f4fcd4d
MD5 a05f33903353d0b752f46532c0b60f4a
BLAKE2b-256 9d4b4f65b259f225ef1a64a1abd861600419fcfbfcf5d9f7e8d900227640488b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page