Skip to main content

Automated dataset analyzer and HTML report generator

Project description

DovaLens – Automated Data Profiling & Drift Detection

GitHub Release PyPI version Python versions License: MIT

DovaLens is a command-line tool that turns a raw CSV into a clean, visual HTML report.

  • Dataset profiling (schema, preview, missing values)
  • Summary statistics for numeric and categorical features
  • Distribution breakdowns (by state/county/date, etc. when present)
  • Bimodality checks on numeric targets (Pearson's coefficient)
  • Unsupervised clustering (K-Means) for quick segmentation
  • Anomaly detection (Isolation Forest) on multivariate signals
  • Drift signals via two-sample Kolmogorov–Smirnov tests
  • A single, shareable report.html

Built for fast EDA on small to very large CSV files.
Works out of the box, no notebook required.


Installation

pip install dovalens

##If you are developing locally from the repo:

pip install -e .

##Quick Start

# Basic
dovalens path/to/your_dataset.csv

# Custom output path
dovalens path/to/your_dataset.csv --output path/to/report.html

If --output is omitted, the report is saved as ./report.html in the current working directory.

Works from any folder: pass either a relative path (.\examples\german_credit_data.csv on Windows) or an absolute one.

##CLI
usage: dovalens [-h] [--output OUTPUT] input

DovaLens — Automated dataset analyzer

positional arguments:
  input            Input CSV file

options:
  -h, --help       Show help and exit
  --output OUTPUT  Output HTML report path (default: ./report.html)

##Examples
# From the project root (Windows PowerShell)
dovalens .\examples\german_credit_data.csv

# Custom name and folder
dovalens .\examples\covid_de.csv --output .\covid_report.html

# From anywhere with absolute path

dovalens D:\data\sales_2024.csv --output D:\reports\sales_2024_report.html

##What's in the Report

Dataset preview (head, dtypes, inferred categorical columns)
Cleaning rules applied (remove Unnamed:*, numeric coercion)
Distributions for main fields (value counts / histograms)
Bimodality coefficients for selected numeric columns
Correlations (Pearson) on numeric features
Unsupervised clustering (K-Means, k auto-selected heuristically)
Anomalies via IsolationForest (top outliers)
Drift checks (two-sample KS) across common grouping keys when present (e.g., by state/county/date)

##How It Works (Technical Overview)
Loading & Cleaning
Drops columns like Unnamed:*
Safe numeric coercion for string-encoded numbers
Low-cardinality columns are treated as categorical

##Profiling & Statistics
Head/preview, dtypes, missingness
Summary stats for numeric & categorical features

##Signals & Metrics
Distributions / value counts
Pearson correlations for numeric pairs
Bimodality coefficient to flag multi-modal shapes

##Unsupervised Structure
K-Means on standardized numeric subsets to expose coarse segments
Cluster sizes reported to highlight dominant patterns

##Anomalies
Isolation Forest surfaces atypical rows based on multivariate behavior

##Drift
Two-sample KS tests compare distributions across groups (when sensible grouping keys exist)

##Report
Everything is assembled into a single, portable HTML file you can open and share.

##Performance Notes
Handles very large CSVs; if you hit memory limits, consider:

Running on a machine with more RAM
Pre-filtering columns not needed for EDA
Sampling rows for a quick first look

##Limitations
CSV schema inference may need manual cleanup for exotic formats
KS drift checks rely on meaningful grouping keys

##License
MIT — see LICENSE [blocked].

## DovaLens – Profilazione automatica del dataset & Rilevamento del Drift (IT)
#DovaLens è un tool da riga di comando che trasforma un CSV grezzo in un report HTML leggibile.

Profilazione dataset (schema, anteprima, valori mancanti)
Statistiche descrittive per feature numeriche e categoriche
Distribuzioni (per stato/provincia/data, quando presenti)
Controlli di bimodalità su target numerici (coeff. di Pearson)
Clustering non supervisionato (K-Means) per segmentazioni rapide
Rilevamento anomalie (Isolation Forest) su segnali multivariati
Drift con test Kolmogorov–Smirnov a due campioni
Un unico report.html condivisibile


##Installazione
pip install dovalens

##Per sviluppo locale dal repository:

pip install -e .

##Avvio Rapido
# Base
dovalens path/al/tuo_dataset.csv

# Output personalizzato
dovalens path/al/tuo_dataset.csv --output path/al/report.html

Se --output non è specificato, il report viene salvato come ./report.html nella cartella corrente.

Puoi usare un percorso relativo (.\examples\german_credit_data.csv) o assoluto.

##Cosa Contiene il Report
Anteprima dataset (head, dtypes, colonne categoriche inferite)
Regole di pulizia (rimozione Unnamed:*, coercizione numerica)
Distribuzioni dei campi principali (conteggi / istogrammi)
Bimodalità per colonne numeriche selezionate
Correlazioni (Pearson)
Clustering (K-Means, k scelto euristicamente)
Anomalie con IsolationForest (outlier principali)
Drift (test KS) su chiavi di raggruppamento quando presenti

##Come Funziona (Overview)
Caricamento & pulizia → Profilazione → Segnali (distribuzioni, correlazioni, bimodalità) → Clustering (K-Means) → Anomalie (Isolation Forest) → Drift (KS) → report HTML unico.

##Note di Performance
Gestisce CSV molto grandi; in caso di limiti di memoria valuta:

Macchina con più RAM
Selezione delle sole colonne utili
Campionamento righe per una prima occhiata

##Limitazioni
L'inferenza dello schema può richiedere fix manuali per formati atipici
I controlli di drift richiedono chiavi di gruppo significative

##Licenza
MIT — vedi LICENSE [blocked].

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dovalens-1.0.4.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dovalens-1.0.4-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file dovalens-1.0.4.tar.gz.

File metadata

  • Download URL: dovalens-1.0.4.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for dovalens-1.0.4.tar.gz
Algorithm Hash digest
SHA256 503a96d281109b0c049cd68f97337122b09aa7220cc968a7632f79d0faa26e70
MD5 48cb899180a5b685df8208ddee31ac47
BLAKE2b-256 36e61f61aef490ae4e0ab2f35d2b4e9b972de9daadbdaa77acccdd09e78296c9

See more details on using hashes here.

File details

Details for the file dovalens-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: dovalens-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for dovalens-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5c06866cffac57db2a4a3d8fcba9626f9fdebe6430e3cf50c85eb1d719a22541
MD5 45abe32166d49c8727ee4030ed65d70d
BLAKE2b-256 4c25a8dce136cab1c41a3702ac63bdd6ab3910f2475faf2b3f66e3523afc4c8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page