Skip to main content

Automated dataset analyzer and HTML report generator

Project description

DovaLens – Automated Data Profiling & Drift Detection

GitHub Release PyPI version Python versions License: MIT

DovaLens is a command-line tool that turns a raw CSV into a clean, visual HTML report.

  • Dataset profiling (schema, preview, missing values)
  • Summary statistics for numeric and categorical features
  • Distribution breakdowns (by state/county/date, etc. when present)
  • Bimodality checks on numeric targets (Pearson's coefficient)
  • Unsupervised clustering (K-Means) for quick segmentation
  • Anomaly detection (Isolation Forest) on multivariate signals
  • Drift signals via two-sample Kolmogorov–Smirnov tests
  • A single, shareable report.html

Built for fast EDA on small to very large CSV files.
Works out of the box, no notebook required.


Installation

pip install dovalens

##If you are developing locally from the repo:

pip install -e .

##Quick Start

# Basic
dovalens path/to/your_dataset.csv

# Custom output path
dovalens path/to/your_dataset.csv --output path/to/report.html

If --output is omitted, the report is saved as ./report.html in the current working directory.

Works from any folder: pass either a relative path (.\examples\german_credit_data.csv on Windows) or an absolute one.

##CLI
usage: dovalens [-h] [--output OUTPUT] input

DovaLens — Automated dataset analyzer

positional arguments:
  input            Input CSV file

options:
  -h, --help       Show help and exit
  --output OUTPUT  Output HTML report path (default: ./report.html)

##Examples
# From the project root (Windows PowerShell)
dovalens .\examples\german_credit_data.csv

# Custom name and folder
dovalens .\examples\covid_de.csv --output .\covid_report.html

# From anywhere with absolute path

dovalens D:\data\sales_2024.csv --output D:\reports\sales_2024_report.html

##What's in the Report

Dataset preview (head, dtypes, inferred categorical columns)
Cleaning rules applied (remove Unnamed:*, numeric coercion)
Distributions for main fields (value counts / histograms)
Bimodality coefficients for selected numeric columns
Correlations (Pearson) on numeric features
Unsupervised clustering (K-Means, k auto-selected heuristically)
Anomalies via IsolationForest (top outliers)
Drift checks (two-sample KS) across common grouping keys when present (e.g., by state/county/date)

##How It Works (Technical Overview)
Loading & Cleaning
Drops columns like Unnamed:*
Safe numeric coercion for string-encoded numbers
Low-cardinality columns are treated as categorical

##Profiling & Statistics
Head/preview, dtypes, missingness
Summary stats for numeric & categorical features

##Signals & Metrics
Distributions / value counts
Pearson correlations for numeric pairs
Bimodality coefficient to flag multi-modal shapes

##Unsupervised Structure
K-Means on standardized numeric subsets to expose coarse segments
Cluster sizes reported to highlight dominant patterns

##Anomalies
Isolation Forest surfaces atypical rows based on multivariate behavior

##Drift
Two-sample KS tests compare distributions across groups (when sensible grouping keys exist)

##Report
Everything is assembled into a single, portable HTML file you can open and share.

##Performance Notes
Handles very large CSVs; if you hit memory limits, consider:

Running on a machine with more RAM
Pre-filtering columns not needed for EDA
Sampling rows for a quick first look

##Limitations
CSV schema inference may need manual cleanup for exotic formats
KS drift checks rely on meaningful grouping keys

##License
MIT — see LICENSE [blocked].

## DovaLens – Profilazione automatica del dataset & Rilevamento del Drift (IT)
#DovaLens è un tool da riga di comando che trasforma un CSV grezzo in un report HTML leggibile.

Profilazione dataset (schema, anteprima, valori mancanti)
Statistiche descrittive per feature numeriche e categoriche
Distribuzioni (per stato/provincia/data, quando presenti)
Controlli di bimodalità su target numerici (coeff. di Pearson)
Clustering non supervisionato (K-Means) per segmentazioni rapide
Rilevamento anomalie (Isolation Forest) su segnali multivariati
Drift con test Kolmogorov–Smirnov a due campioni
Un unico report.html condivisibile


##Installazione
pip install dovalens

##Per sviluppo locale dal repository:

pip install -e .

##Avvio Rapido
# Base
dovalens path/al/tuo_dataset.csv

# Output personalizzato
dovalens path/al/tuo_dataset.csv --output path/al/report.html

Se --output non è specificato, il report viene salvato come ./report.html nella cartella corrente.

Puoi usare un percorso relativo (.\examples\german_credit_data.csv) o assoluto.

##Cosa Contiene il Report
Anteprima dataset (head, dtypes, colonne categoriche inferite)
Regole di pulizia (rimozione Unnamed:*, coercizione numerica)
Distribuzioni dei campi principali (conteggi / istogrammi)
Bimodalità per colonne numeriche selezionate
Correlazioni (Pearson)
Clustering (K-Means, k scelto euristicamente)
Anomalie con IsolationForest (outlier principali)
Drift (test KS) su chiavi di raggruppamento quando presenti

##Come Funziona (Overview)
Caricamento & pulizia → Profilazione → Segnali (distribuzioni, correlazioni, bimodalità) → Clustering (K-Means) → Anomalie (Isolation Forest) → Drift (KS) → report HTML unico.

##Note di Performance
Gestisce CSV molto grandi; in caso di limiti di memoria valuta:

Macchina con più RAM
Selezione delle sole colonne utili
Campionamento righe per una prima occhiata

##Limitazioni
L'inferenza dello schema può richiedere fix manuali per formati atipici
I controlli di drift richiedono chiavi di gruppo significative

##Licenza
MIT — vedi LICENSE [blocked].

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dovalens-1.0.5.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dovalens-1.0.5-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file dovalens-1.0.5.tar.gz.

File metadata

  • Download URL: dovalens-1.0.5.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for dovalens-1.0.5.tar.gz
Algorithm Hash digest
SHA256 71128e1016ceb0a80d8d48dfbdb5b195cf092bd7cda4ce16fbc8cf67f522d8e6
MD5 2d7171f5c37a765fadda847b419db11f
BLAKE2b-256 87e60d23b295ed28f4cec5c91786e81c5034eadd8fe9e6349fd6ec0294953b13

See more details on using hashes here.

File details

Details for the file dovalens-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: dovalens-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for dovalens-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ea5eb5cfc6ee739cdba7bc5e60fd227c11bb11a739ce476e1031f2a56037c2e0
MD5 227a7c6a14f1566a91dc6d1945cffb60
BLAKE2b-256 a32158f302e1385308ce4b561d1e84ad020965cbfd5150b9f987070451efef12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page