Skip to main content

Automated dataset analyzer and HTML report generator

Project description

DovaLens – Automated Data Profiling & Drift Detection

PyPI version Python versions License: MIT

DovaLens is a command-line tool that turns a raw CSV into a clean, visual HTML report.

  • Dataset profiling (schema, preview, missing values)
  • Summary statistics for numeric and categorical features
  • Distribution breakdowns (by state/county/date, etc. when present)
  • Bimodality checks on numeric targets (Pearson's coefficient)
  • Unsupervised clustering (K-Means) for quick segmentation
  • Anomaly detection (Isolation Forest) on multivariate signals
  • Drift signals via two-sample Kolmogorov–Smirnov tests
  • A single, shareable report.html

Built for fast EDA on small to very large CSV files.
Works out of the box, no notebook required.


Installation

pip install dovalens

##If you are developing locally from the repo:

pip install -e .

##Quick Start

# Basic
dovalens path/to/your_dataset.csv

# Custom output path
dovalens path/to/your_dataset.csv --output path/to/report.html

If --output is omitted, the report is saved as ./report.html in the current working directory.

Works from any folder: pass either a relative path (.\examples\german_credit_data.csv on Windows) or an absolute one.

##CLI
usage: dovalens [-h] [--output OUTPUT] input

DovaLens  Automated dataset analyzer

positional arguments:
  input            Input CSV file

options:
  -h, --help       Show help and exit
  --output OUTPUT  Output HTML report path (default: ./report.html)

##Examples
# From the project root (Windows PowerShell)
dovalens .\examples\german_credit_data.csv

# Custom name and folder
dovalens .\examples\covid_de.csv --output .\covid_report.html

# From anywhere with absolute path

dovalens D:\data\sales_2024.csv --output D:\reports\sales_2024_report.html

##What's in the Report

Dataset preview (head, dtypes, inferred categorical columns)
Cleaning rules applied (remove Unnamed:*, numeric coercion)
Distributions for main fields (value counts / histograms)
Bimodality coefficients for selected numeric columns
Correlations (Pearson) on numeric features
Unsupervised clustering (K-Means, k auto-selected heuristically)
Anomalies via IsolationForest (top outliers)
Drift checks (two-sample KS) across common grouping keys when present (e.g., by state/county/date)

##How It Works (Technical Overview)
Loading & Cleaning
Drops columns like Unnamed:*
Safe numeric coercion for string-encoded numbers
Low-cardinality columns are treated as categorical

##Profiling & Statistics
Head/preview, dtypes, missingness
Summary stats for numeric & categorical features

##Signals & Metrics
Distributions / value counts
Pearson correlations for numeric pairs
Bimodality coefficient to flag multi-modal shapes

##Unsupervised Structure
K-Means on standardized numeric subsets to expose coarse segments
Cluster sizes reported to highlight dominant patterns

##Anomalies
Isolation Forest surfaces atypical rows based on multivariate behavior

##Drift
Two-sample KS tests compare distributions across groups (when sensible grouping keys exist)

##Report
Everything is assembled into a single, portable HTML file you can open and share.

##Performance Notes
Handles very large CSVs; if you hit memory limits, consider:

Running on a machine with more RAM
Pre-filtering columns not needed for EDA
Sampling rows for a quick first look

##Limitations
CSV schema inference may need manual cleanup for exotic formats
KS drift checks rely on meaningful grouping keys

##License
MIT  see LICENSE [blocked].

## DovaLens – Profilazione automatica del dataset & Rilevamento del Drift (IT)
#DovaLens è un tool da riga di comando che trasforma un CSV grezzo in un report HTML leggibile.

Profilazione dataset (schema, anteprima, valori mancanti)
Statistiche descrittive per feature numeriche e categoriche
Distribuzioni (per stato/provincia/data, quando presenti)
Controlli di bimodalità su target numerici (coeff. di Pearson)
Clustering non supervisionato (K-Means) per segmentazioni rapide
Rilevamento anomalie (Isolation Forest) su segnali multivariati
Drift con test Kolmogorov–Smirnov a due campioni
Un unico report.html condivisibile


##Installazione
pip install dovalens

##Per sviluppo locale dal repository:

pip install -e .

##Avvio Rapido
# Base
dovalens path/al/tuo_dataset.csv

# Output personalizzato
dovalens path/al/tuo_dataset.csv --output path/al/report.html

Se --output non è specificato, il report viene salvato come ./report.html nella cartella corrente.

Puoi usare un percorso relativo (.\examples\german_credit_data.csv) o assoluto.

##Cosa Contiene il Report
Anteprima dataset (head, dtypes, colonne categoriche inferite)
Regole di pulizia (rimozione Unnamed:*, coercizione numerica)
Distribuzioni dei campi principali (conteggi / istogrammi)
Bimodalità per colonne numeriche selezionate
Correlazioni (Pearson)
Clustering (K-Means, k scelto euristicamente)
Anomalie con IsolationForest (outlier principali)
Drift (test KS) su chiavi di raggruppamento quando presenti

##Come Funziona (Overview)
Caricamento & pulizia  Profilazione  Segnali (distribuzioni, correlazioni, bimodalità)  Clustering (K-Means)  Anomalie (Isolation Forest)  Drift (KS)  report HTML unico.

##Note di Performance
Gestisce CSV molto grandi; in caso di limiti di memoria valuta:

Macchina con più RAM
Selezione delle sole colonne utili
Campionamento righe per una prima occhiata

##Limitazioni
L'inferenza dello schema può richiedere fix manuali per formati atipici
I controlli di drift richiedono chiavi di gruppo significative

##Licenza
MIT  vedi LICENSE [blocked].

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dovalens-1.0.2.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dovalens-1.0.2-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file dovalens-1.0.2.tar.gz.

File metadata

  • Download URL: dovalens-1.0.2.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for dovalens-1.0.2.tar.gz
Algorithm Hash digest
SHA256 c576fd0dd8398fb4cc285c0ccf2cb9aac01be4a6ad08976053a533a4d3cc621b
MD5 a05946a8caee4146e004a4510370649f
BLAKE2b-256 2c551cf11f16389b2d8525dbabbd6c281725e2b044f6e5a8682c8654b880bc6c

See more details on using hashes here.

File details

Details for the file dovalens-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: dovalens-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for dovalens-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ea283a990a021e156bade63090bbfb34667a1d2399373540ecee07ed0bbb22b3
MD5 8a8c6fb53712be1bf6b02d1e9553356c
BLAKE2b-256 933a69837c4d91b17a6ded4717c5c037e7fe0eed483db8a6a5df1389a24c296a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page