Skip to main content

Ultra-high-performance data profiling natively for Ibis

Project description

Ibis Profiling Logo

Ibis Profiling

PyPI version License: MIT

An ultra-high-performance data profiling system built natively for Ibis.

Core Principle: Profiling as Query Compilation

Unlike traditional profiling tools (e.g., ydata-profiling) that load data into local memory (Pandas), Ibis Profiling treats profiling as a query planning problem.

It compiles dozens of statistical metrics into a minimal set of optimized SQL queries that execute directly in your remote backend (DuckDB, BigQuery, Snowflake, ClickHouse, etc.). This ensures that computation happens where the data lives, enabling the profiling of multi-billion row datasets in seconds rather than hours.


🖼️ Preview

Overview Dashboard

Overview Screenshot

Variable Detail View

Variables Screenshot

Missing Values Analysis

Missing Screenshot


🚀 Key Features

  • Backend Pushdown: 100% of the heavy lifting is done by the database engine.
  • Multi-Pass Execution: Intelligently splits computation into optimized passes to handle complex moments (Skewness, MAD) without backend "nested aggregation" errors.
  • JSON Schema Parity: Achieves full structural and statistical parity with ydata-profiling, allowing drop-in replacement for downstream automated pipelines.
  • Modern SPA Report: Generates a lightweight Single Page Application (SPA) with a modern React-based UI.
  • Adjustable Themes: Includes built-in support for Dark, Light, and High Contrast modes with persistent user settings.
  • Auto-Categorical Detection: Intelligent heuristics automatically reclassify low-cardinality integers (e.g., status codes, term months) as categorical for better visualization.
  • DateTime Distribution: Full support for temporal histograms and distribution analysis.
  • Excel Support: Directly profile Excel files (.xlsx, .xls, .xlsb) using high-performance Rust-based parsing.
  • Scalability: Profile 5 million rows in < 12 seconds (Minimal mode) and < 22 seconds (Full mode).
  • Python Compatibility: Fully tested on Python 3.11 through 3.14.3 (Core functionality).

🛡️ Backend Stability & NaN Handling

A critical challenge in database-native profiling is the handling of NaN (Not-a-Number) values in floating-point columns. Traditional database aggregations (like STDDEV_SAMP in DuckDB) often throw OutOfRange errors when encountering NaNs.

Ibis Profiling implements a Safe-Aggregation layer that automatically treats NaN values as NULL during statistical computation. This ensures:

  1. Zero Crash Policy: Profiles complete successfully even on messy synthetic or sensor data.
  2. Mathematical Consistency: Statistics (mean, std, variance) are computed on the subset of valid numeric values, matching the behavior of high-level tools like Pandas while staying within the database.

📈 Performance Benchmarks

Benchmarks were conducted using a synthetic dataset with 20 columns (mix of numeric, categorical, text, and boolean) on a standard Linux environment using the DuckDB backend.

Dataset Size Ibis (Min) Ibis (Full) ydata (Min) ydata (Full) Mem Ibis (Min) Mem ydata (Min) Mem Ibis (Full) Mem ydata (Full)
10k Rows 0.89s 1.40s 9.94s 28.38s ~2.4 MB ~74 MB ~4.5 MB ~107 MB
25k Rows 1.03s 1.57s 12.20s 30.47s ~2.1 MB ~154 MB ~4.4 MB ~188 MB
50k Rows 1.22s 1.82s 16.63s 35.10s ~2.0 MB ~284 MB ~4.4 MB ~324 MB
500k Rows 2.29s 3.56s 91.93s ~3m (est) ~2.0 MB ~2.5 GB ~4.4 MB ~2.8 GB (est)
1M Rows 3.14s 5.44s 166.31s ~6m (est) ~2.1 MB ~4.9 GB ~4.4 MB ~5.3 GB (est)
5M Rows 10.69s 17.88s ~14m (est) ~45m (est) ~2.0 MB >20 GB (est) ~4.4 MB >25 GB (est)
10M Rows 20.64s 21.29s* ~28m (est) ~1.5h (est) ~2.4 MB >40 GB (est) ~2.6 MB* >50 GB (est)
20M Rows 43.98s 8.77s** >1h (est) >3h (est) ~2.4 MB >80 GB (est) ~1.0 MB** >100 GB (est)

*Notes:

  • 10M Full (21.29s) used 10 columns.
  • 20M Full (8.77s) used 5 columns.
  • All other benchmarks use 20 columns.
  • Ibis memory usage is nearly constant and extremely low compared to ydata-profiling due to database-native pushdown.*

🔍 Estimation Methodology

Projections for ydata-profiling on larger datasets are derived from observed scaling trends:

  • Time (Minimal): Scaled linearly based on the jump from 500k (92s) to 1M (166s) rows.
  • Time (Full): Scaled with a factor of ~2.5x - 3x over Minimal mode, consistent with small-sample ratios.
  • Memory: Scaled linearly based on observed peak usage (~2.5 GB at 500k, ~4.9 GB at 1M), reflecting the overhead of loading the full dataset into Pandas DataFrames.

🛠 Installation

Install Ibis Profiling directly from PyPI:

Using uv (Recommended)

uv add ibis-profiling

Using pip

pip install ibis-profiling

💻 Usage

Command Line Interface (CLI)

You can profile datasets directly from the terminal without writing any Python code.

Basic Usage

# If installed locally
uv run ProfileReport --file-path data.csv --output report.html

# One-off run (no installation required)
uv run --with ibis-profiling,ibis-framework[duckdb] ProfileReport --file-path data.parquet

CLI Options

Option Shortcut Description
--file-path -f (Required) Path to input file (CSV, Parquet, Excel).
--output -o Path to output file (default: report.html).
--title -t Custom report title.
--minimal Generate a minimal report (faster).
--theme Report theme: default or ydata-like.
--format Force output format: html or json.
--correlations Explicitly enable or disable correlations (--correlations / --no-correlations).
--monotonicity Explicitly enable or disable monotonicity checks (--monotonicity / --no-monotonicity).
--duplicates Explicitly enable or disable duplicate row checks (--duplicates / --no-duplicates).

Quick Start (ydata-style API)

import ibis
from ibis_profiling import ProfileReport

# 1. Connect to any Ibis-supported backend
con = ibis.duckdb.connect()
table = con.read_parquet("large_dataset.parquet")

# 2. Generate the report with custom title
report = ProfileReport(table, title="Loan Analysis Report")

# 3. Export results
report.to_file("report.html")

Excel Ingestion

from ibis_profiling import ProfileReport

# Directly profile Excel files with high-performance parsing
report = ProfileReport.from_excel("data.xlsx")
report.to_file("excel_report.html")

Advanced Usage

from ibis_profiling import profile

# Get the raw description dictionary
report = profile(table)
stats = report.to_dict()

print(f"Dataset Skewness: {stats['variables']['income']['skewness']}")

Minimal vs. Full Profiling

The ProfileReport supports a minimal flag (default False) to toggle between fast exploratory profiling and deep statistical analysis.

Feature Minimal Mode (minimal=True) Full Mode (minimal=False)
Core Stats Count, Mean, Std, Min/Max, Zeros, Nullity. All Minimal stats.
Table Metadata Estimated Memory/Record Size. Same as Minimal.
Advanced Moments Skipped. Skewness, Kurtosis, MAD.
Correlations Skipped. Pearson and Spearman matrices.
Advanced Analysis Skipped. Extreme Values, Monotonicity, Text Lengths.
Visualizations Histograms (Numeric/DateTime), Summary only. Nullity Matrix (SVG), Heatmap, Scatter Plots.
Duplicates Skipped. Dataset-wide duplicate row count.
Performance Ultra-Fast. Recommended for datasets > 50M rows. Detailed. Recommended for deep data quality audits.

📦 Report Export & Layouts

By default, ibis-profiling minifies the generated HTML report to reduce file size (typically by 15-20%) without compromising functionality.

Custom Themes & Themes

The to_file method supports a theme parameter to choose between different report layouts:

# Modern React SPA (Default)
report.to_file("report.html", theme="default")

# Classic layout for ydata-profiling parity
report.to_file("report.html", theme="ydata-like")

Minification

To generate a human-readable (non-minified) report, set minify=False:

# Save as formatted HTML
report.to_file("report.html", minify=False)

🎨 Interactive React SPA

The default report is a fully interactive React Single Page Application (SPA) providing a modern user experience:

  • Instant Search: Quickly filter variables by name or type.
  • Theme Toggle: Switch between Dark, Light, and High Contrast modes with persistent settings.
  • Alert Filtering: Interactive dashboard to filter data quality alerts by severity (Warning vs. Info).
  • Responsive Charts: High-fidelity SVG and Canvas-based visualizations (Histograms, Heatmaps, Scatter Plots).

🏗 Architecture & Backend Support

The system is decoupled into five core modules designed for maximum backend compatibility:

  1. Dataset Inspector: Zero-execution schema analysis.
  2. Metric Registry: Declarative metric definitions as Ibis expressions.
  3. Query Planner: The "compiler" that batches compatible expressions into minimal execution plans.
  4. Execution Engine: Multi-pass dispatcher that handles simple vs. complex aggregations.
  5. Report Builder: Aggregates and formats raw backend results into high-fidelity JSON/HTML.

Supported Backends: 100% compatibility with all Ibis backends including DuckDB, Snowflake, BigQuery, ClickHouse, Postgres, Polars, and more.


📊 Missing Values Analysis

Move beyond simple counts with advanced pattern detection and visualization:

  • Missing Matrix: A vertical sparkline grid (SVG) visualizing the exact location of missing values across rows, allowing you to spot temporal or structural gaps.
  • Nullity Heatmap: Pearson correlation of "nullity" between variables, revealing structural dependencies (e.g., when Column A is missing, Column B is also 90% likely to be missing).

🔍 Pairwise Interactions

Explore dependencies between numeric variables with high-performance scatter plots:

  • Automatic Selection: Intelligently samples the dataset to maintain 60FPS interactivity even with millions of rows.
  • Correlation-Driven: Highlights pairs with high statistical significance to surface hidden patterns.
  • Canvas-Optimized: Uses HTML5 Canvas for smooth rendering of thousands of data points directly in the browser.

📏 Metrics & Calculation Reference

1. Variable Calculations

The profiler uses a multi-pass execution engine to compute statistics efficiently across massive datasets while remaining compatible with SQL-based backends.

Core Statistics (Pass 1)

Metric Calculation Type
n Total number of observations (rows) in the table. All
n_missing Count of NULL or NaN values. All
p_missing n_missing / n All
n_distinct Count of unique values (excluding NULL). All
mean sum(x) / count (NaNs treated as NULL) Numeric
std Sample standard deviation (Bessel's correction). Numeric
min / max Minimum and maximum values. Numeric, DateTime
histogram Binned distribution (Numeric/DateTime) or Top Values (Categorical). All

Advanced Statistics (Pass 2)

To avoid "Nested Aggregation" errors in SQL backends, these are computed using values from Pass 1 as constants.

Metric Calculation Logic
skewness mean( ((x - μ) / σ)^3 ) Standardized 3rd moment.
mad mean( abs(x - μ) ) Mean Absolute Deviation.
n_duplicates n - count(distinct_rows) Dataset-wide duplicate row count.

2. Alert Engine Logic

Alert Type Logic / Threshold Severity
CONSTANT n_distinct == 1 warning
UNIQUE n_distinct == n warning
HIGH_CARDINALITY p_distinct > 0.5 (Categorical only) warning
MISSING p_missing > 0.05 info
ZEROS p_zeros > 0.10 info
SKEWED abs(skewness) > 10 info

📄 License

Ibis Profiling is licensed under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ibis_profiling-0.1.6.tar.gz (799.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ibis_profiling-0.1.6-py3-none-any.whl (804.5 kB view details)

Uploaded Python 3

File details

Details for the file ibis_profiling-0.1.6.tar.gz.

File metadata

  • Download URL: ibis_profiling-0.1.6.tar.gz
  • Upload date:
  • Size: 799.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ibis_profiling-0.1.6.tar.gz
Algorithm Hash digest
SHA256 86591139dd9036560ffd2f213ab2e115795146ebe08328964c91b57a08339d28
MD5 fbdc80f2d12e885794fe65fc03b2da0d
BLAKE2b-256 6fc9f39f242b62a935de45c6af0fba3d45d8f58e753196c37e829b33f8d55dc2

See more details on using hashes here.

Provenance

The following attestation bundles were made for ibis_profiling-0.1.6.tar.gz:

Publisher: publish.yml on beallio/ibis-profiling

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ibis_profiling-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: ibis_profiling-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 804.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ibis_profiling-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 703b5c59f6783de47be5532cd4b288d115d4e419423f55237f134535d2adb878
MD5 cc087e102bb208713bc2697d45e49d7a
BLAKE2b-256 565f8bd73ec3141628155bcd2cf9db49eab824d989fe89177daeee8b15e90fc8

See more details on using hashes here.

Provenance

The following attestation bundles were made for ibis_profiling-0.1.6-py3-none-any.whl:

Publisher: publish.yml on beallio/ibis-profiling

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page