Automated dataset analyzer and HTML report generator
Project description
DovaLens – Automated Data Profiling & Drift Detection
DovaLens is a command-line tool that turns a raw CSV into a clean, visual HTML report.
- Dataset profiling (schema, preview, missing values)
- Summary statistics for numeric and categorical features
- Distribution breakdowns (by state/county/date, etc. when present)
- Bimodality checks on numeric targets (Pearson's coefficient)
- Unsupervised clustering (K-Means) for quick segmentation
- Anomaly detection (Isolation Forest) on multivariate signals
- Drift signals via two-sample Kolmogorov–Smirnov tests
- A single, shareable report.html
Built for fast EDA on small to very large CSV files.
Works out of the box, no notebook required.
Installation
pip install dovalens
##If you are developing locally from the repo:
pip install -e .
##Quick Start
# Basic
dovalens path/to/your_dataset.csv
# Custom output path
dovalens path/to/your_dataset.csv --output path/to/report.html
If --output is omitted, the report is saved as ./report.html in the current working directory.
Works from any folder: pass either a relative path (.\examples\german_credit_data.csv on Windows) or an absolute one.
##CLI
usage: dovalens [-h] [--output OUTPUT] input
DovaLens — Automated dataset analyzer
positional arguments:
input Input CSV file
options:
-h, --help Show help and exit
--output OUTPUT Output HTML report path (default: ./report.html)
##Examples
# From the project root (Windows PowerShell)
dovalens .\examples\german_credit_data.csv
# Custom name and folder
dovalens .\examples\covid_de.csv --output .\covid_report.html
# From anywhere with absolute path
dovalens D:\data\sales_2024.csv --output D:\reports\sales_2024_report.html
##What's in the Report
Dataset preview (head, dtypes, inferred categorical columns)
Cleaning rules applied (remove Unnamed:*, numeric coercion)
Distributions for main fields (value counts / histograms)
Bimodality coefficients for selected numeric columns
Correlations (Pearson) on numeric features
Unsupervised clustering (K-Means, k auto-selected heuristically)
Anomalies via IsolationForest (top outliers)
Drift checks (two-sample KS) across common grouping keys when present (e.g., by state/county/date)
##How It Works (Technical Overview)
Loading & Cleaning
Drops columns like Unnamed:*
Safe numeric coercion for string-encoded numbers
Low-cardinality columns are treated as categorical
##Profiling & Statistics
Head/preview, dtypes, missingness
Summary stats for numeric & categorical features
##Signals & Metrics
Distributions / value counts
Pearson correlations for numeric pairs
Bimodality coefficient to flag multi-modal shapes
##Unsupervised Structure
K-Means on standardized numeric subsets to expose coarse segments
Cluster sizes reported to highlight dominant patterns
##Anomalies
Isolation Forest surfaces atypical rows based on multivariate behavior
##Drift
Two-sample KS tests compare distributions across groups (when sensible grouping keys exist)
##Report
Everything is assembled into a single, portable HTML file you can open and share.
##Performance Notes
Handles very large CSVs; if you hit memory limits, consider:
Running on a machine with more RAM
Pre-filtering columns not needed for EDA
Sampling rows for a quick first look
##Limitations
CSV schema inference may need manual cleanup for exotic formats
KS drift checks rely on meaningful grouping keys
##License
MIT — see LICENSE [blocked].
## DovaLens – Profilazione automatica del dataset & Rilevamento del Drift (IT)
#DovaLens è un tool da riga di comando che trasforma un CSV grezzo in un report HTML leggibile.
Profilazione dataset (schema, anteprima, valori mancanti)
Statistiche descrittive per feature numeriche e categoriche
Distribuzioni (per stato/provincia/data, quando presenti)
Controlli di bimodalità su target numerici (coeff. di Pearson)
Clustering non supervisionato (K-Means) per segmentazioni rapide
Rilevamento anomalie (Isolation Forest) su segnali multivariati
Drift con test Kolmogorov–Smirnov a due campioni
Un unico report.html condivisibile
##Installazione
pip install dovalens
##Per sviluppo locale dal repository:
pip install -e .
##Avvio Rapido
# Base
dovalens path/al/tuo_dataset.csv
# Output personalizzato
dovalens path/al/tuo_dataset.csv --output path/al/report.html
Se --output non è specificato, il report viene salvato come ./report.html nella cartella corrente.
Puoi usare un percorso relativo (.\examples\german_credit_data.csv) o assoluto.
##Cosa Contiene il Report
Anteprima dataset (head, dtypes, colonne categoriche inferite)
Regole di pulizia (rimozione Unnamed:*, coercizione numerica)
Distribuzioni dei campi principali (conteggi / istogrammi)
Bimodalità per colonne numeriche selezionate
Correlazioni (Pearson)
Clustering (K-Means, k scelto euristicamente)
Anomalie con IsolationForest (outlier principali)
Drift (test KS) su chiavi di raggruppamento quando presenti
##Come Funziona (Overview)
Caricamento & pulizia → Profilazione → Segnali (distribuzioni, correlazioni, bimodalità ) → Clustering (K-Means) → Anomalie (Isolation Forest) → Drift (KS) → report HTML unico.
##Note di Performance
Gestisce CSV molto grandi; in caso di limiti di memoria valuta:
Macchina con più RAM
Selezione delle sole colonne utili
Campionamento righe per una prima occhiata
##Limitazioni
L'inferenza dello schema può richiedere fix manuali per formati atipici
I controlli di drift richiedono chiavi di gruppo significative
##Licenza
MIT — vedi LICENSE [blocked].
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dovalens-1.0.5.tar.gz
(14.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
dovalens-1.0.5-py3-none-any.whl
(12.8 kB
view details)
File details
Details for the file dovalens-1.0.5.tar.gz.
File metadata
- Download URL: dovalens-1.0.5.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71128e1016ceb0a80d8d48dfbdb5b195cf092bd7cda4ce16fbc8cf67f522d8e6
|
|
| MD5 |
2d7171f5c37a765fadda847b419db11f
|
|
| BLAKE2b-256 |
87e60d23b295ed28f4cec5c91786e81c5034eadd8fe9e6349fd6ec0294953b13
|
File details
Details for the file dovalens-1.0.5-py3-none-any.whl.
File metadata
- Download URL: dovalens-1.0.5-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea5eb5cfc6ee739cdba7bc5e60fd227c11bb11a739ce476e1031f2a56037c2e0
|
|
| MD5 |
227a7c6a14f1566a91dc6d1945cffb60
|
|
| BLAKE2b-256 |
a32158f302e1385308ce4b561d1e84ad020965cbfd5150b9f987070451efef12
|