High-performance profiling and data quality tool built with Polars.
Project description
Netra Profiler
High-performance profiling and data quality tool built with Polars
Netra Profiler is a high-performance data profiling tool and diagnostic engine built on top of Polars. It maximizes single-node hardware utilization by leveraging Polars' Rust-based query optimizer and columnar Apache Arrow memory model.
The profiler ships with a configurable diagnostic engine to detect data quality issues in your EDA or ELT workflows. It automatically flags anomalies like extreme zero-inflation, high cardinality, severe data skew, and corrupted primary keys. Netra includes an information-dense, zero-configuration CLI designed to instantly profile your CSV, JSON, IPC/Arrow, and Parquet files directly from the terminal.
Performance Benchmarks
Note: All the scripts used to fetch the dataset, run the benchmarks and generate the results can be found in the 'benchmarks/' directory.
Dataset
To ensure the benchmarks reflect the real-world friction of a typical data workload, we use the New York City TLC Yellow Taxi Trip Records dataset. It contains high-cardinality columns, high-nulls or missing data, and shifting schemas.
When reviewing the metrics below, please keep the following nuances in mind regarding the data:
- Data Organization: The TLC publishes the data as individual
.parquetfiles for every month of the year. To test raw I/O and schema harmonization, we process these files as is, without combining files or pre-processing the data. - The Timeline: We restricted the benchmarks to the years 2018-2024, as the schema remains relatively stable in this interval, and provides sufficient volume for the local benchmarks.
- The COVID-19 Data Cliff: Pre-pandemic files (2018–2019) are significantly larger, containing 8 to 10 million trips per month compared to the 2 to 3 million trips in post-2020 files. For our local tests, we are predominantly using the older files.
- Parquet Compression: All disk sizes referenced in these benchmarks represent the heavily compressed Snappy Parquet files. The actual uncompressed data expanding in-memory is roughly 6x to 10x larger than the on-disk size.
The Data Envelope
Data Envelope is the maximum size and complexity of data your data pipeline can process within your hardware limitations or cloud budget ceiling. Netra Profiler is designed to be a value multiplier for your existing hardware. This allows you to:
- Stay Local Longer: Process larger workloads directly on your laptop or workstation without needing to migrate to an HPC or cloud platform.
- Scale Vertically: Fully saturate a single heavy compute node (like an AWS EC2 instance) to bypass the overhead of complex, multi-node distributed frameworks like Apache Spark.
- Preserve Productivity: Near-interactive profiling at Polars speed leaves no time to get up and grab a coffee while your profiler is spinning up!
A. Single-Node Workstation
All local benchmarks were executed on a consumer laptop machine with the following specifications:
- CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (12 Cores)
- RAM: 32 GB
- OS: Ubuntu 24.04.4 LTS
- Storage: 512GB NVMe SSD
We begin the benchmarks by determining the Data Envelope for this machine. To find the exact hardware redline, we tested both tools by incrementally feeding them months/files of data until an Out-Of-Memory (OOM) crash happened.
1. The Local Envelope
| Profiler | Execution Mode | Maximum Safe Envelope |
|---|---|---|
| ydata-profiling | Standard | 359.2 MB (~26.7 Million rows / 3 Months) |
| ydata-profiling | minimal‡ |
728.0 MB (~54 Million rows / 6 Months) |
| netra-profiler | Standard | 3.58 GB (~255.6 Million rows / 52 Months) |
| netra-profiler | low-memory† |
3.58 GB (~255.6 Million rows / 52 Months) |
‡ minimal (ydata-profiling): Turns off the most expensive computations, including the correlations.
† low-memory (netra-profiler): Replaces exact unique counts with an approximate method (HyperLogLog) and skips global sorts (skew, kurtosis and quantiles). Computes the Pearson/Spearman correlation matrices by using a 100,000-row systematic sample.
Netra expands the local data envelope by nearly 5x, allowing developers to profile roughly 4.5 years of continuous NYC Taxi data directly on their laptop without migrating to the cloud.
2. Head-to-Head Performance
Having established the ~359.2 MB (3 Months) ceiling where both modes of ydata-profiling can successfully execute, we conduct a head-to-head performance comparison of the tools across the standard and efficiency modes.
Results below are averaged over 5 consecutive runs + 1 warmup run.
| Execution Mode | netra-profiler | ydata-profiling |
|---|---|---|
| Standard (Full Stats) | 12.48s (6.1 GB RAM) | 572.39s (28.7 GB RAM) |
low-memory / minimal |
5.14s (4.7 GB RAM) | 75.77s (15.3 GB RAM) |
The standard run of ydata-profiling takes more than 9 minutes because Pandas loads all 26.7 million rows of the dataset into memory at once, exhausting the available physical memory and forcing the operating system to use the swap space on the hard drive to keep the process alive. Polars' lazy execution model and streaming data ingestion allow netra-profiler to profile the same data 45x faster (14x faster for the efficiency mode, with correlations) while using a fraction of the RAM.
B. Cloud Scale-Up (Vertical Scaling)
When scaling up to Cloud or HPC infrastructure to handle larger datasets, netra-profiler enables you to maximize the capacity of a single compute node by minimizing processing time and memory overhead. An expanded single-node data envelope allows your team to avoid complex distributed setups like Apache Spark for routine data profiling.
To demonstrate this, we benchmarked the engine on a standard Enterprise HPC node:
- Machine: AWS EC2 r6id.8xlarge (Memory Optimized)
- CPU: 32 vCPUs
- RAM: 256 GB
- OS: Ubuntu 24.04.4 LTS
- Storage: Attached NVMe SSD
1. The Baseline Run (Full Dataset)
We first process the complete 84-month (7 years) dataset to establish the baseline memory requirement to handle the true cardinality of the data. The total file size on disk is 5.13 GB (362.2 Million Rows).
| Execution Mode | Execution Time | Peak RAM |
|---|---|---|
| Standard | 216.75s | 80.4 GB |
low-memory |
50.75s | 55.7 GB |
2. The Throughput Stress Test
To push the hardware to its limits, we created larger datasets with bounded cardinality by using the Polars' Lazy API to replicate the available data. This simulates contexts like IoT telemetry or server logs, where the volume of data is effectively infinite, but the number of unique identifiers (sensors, IP addresses) is fixed. This benchmark effectively tests the maximum I/O throughput.
| Scale | Compressed Size | Execution Mode | Execution Time | Peak RAM |
|---|---|---|---|---|
| 4x | 20.51 GB (1.45 Billion Rows) |
Standard | 829.68s (~14m) | 158.6 GB |
low-memory |
164.39s (~2.7m) | 105.1 GB | ||
| 10x | 51.28 GB (3.62 Billion Rows) |
Standard | 2029.46s (~34m) | 241.5 GB |
low-memory |
417.27s (~7m) | 171.2 GB |
The 10x Standard run saturated ~94% (241.5 GB) of the instance's 256 GB physical memory to profile 51.28 GB of compressed data without triggering an OOM crash. This establishes the data envelope for this specific node size and dataset. Because compressed Parquet typically expands 5-10x in size in-memory, processing a 51 GB workload normally exceeds the physical limits of a 256 GB machine. By expanding the envelope of a vertically scaled single-node to safely process this volume in 33 minutes (or 7 minutes in low-memory mode), netra-profiler avoids the forced transition to costlier, higher-tier instances or the complexity of a distributed cluster.
C. Distributed Multi-Node (Horizontal Scaling)
Because the core engine of netra-profiler is built entirely on the Polars Lazy API, it is natively compatible with the Polars Distributed Layer out-of-the-box. Moving from a vertically scaled single-node workload to a horizontally scaled multi-node cluster will essentially be a low-friction configuration option.
Note: Native support for Polars Cloud & Distributed, and multi-node benchmarks are currently on the roadmap.
Features
- Multi-Core Streaming Engine: Built on Polars, the profiling engine completely bypasses the Python GIL and utilizes 100% of your CPU cores for maximum performance. Unlike legacy tools that must load the entire dataset into memory for profiling, Netra processes data in streaming batches.
- Low-Memory Mode: Process larger datasets safely. By passing the
--low-memoryflag, the profiler switches to approximate counting and sampling techniques to keep RAM usage low. - Comprehensive Profiling: Automatically extracts scalar statistics, distributions, and correlation matrices based on column data types. (See the Metrics Table below).
- Complex Type Support: Automatically flattens nested JSON/Parquet Structs and computes length statistics for Lists and Arrays. Zero configuration required.
- Built-in Configurable Quality Rules: Stop bad data before it enters your pipeline. Netra's diagnostic engine automatically flags anomalies like zero-inflation, corrupted primary keys, and extreme skewness. All detection thresholds can be customized globally or on a per-column basis via YAML (See Data Quality Rules below).
- CI/CD Pipeline Gatekeeper: Use strict exit codes (
--fail-on-criticalor--fail-on-warnings) to automatically act as a Data Firewall, breaking your CI/CD builds (GitHub Actions, Airflow, GitLab CI) if corrupted data enters the pipeline. - Terminal UI: Includes an information-dense, highly readable CLI dashboard to profile and check your data health directly in the terminal.
- Strictly Typed Profile Output: Access the complete mathematical state of your data via a strictly typed JSON export (
--json) or native Python dictionary. Because the output schema is immutable, you can safely program against it to power custom CI/CD quality gates, feed metadata catalogs, or provide context to LLM data agents. - Python API: Integrate seamlessly into Airflow, Dagster, Marimo/Jupyter Notebooks, and custom pipelines with a clean, expressive programmatic interface.
Supported Metrics & Roadmap
| Metric Category | Feature | Target Data Types | Status |
|---|---|---|---|
| Universal | Null Count | All Types | ✅ Active |
Exact Cardinality (n_unique) |
All Types | ✅ Active | |
| Approximate Cardinality (HyperLogLog) | All Types (--low-memory) |
✅ Active | |
| Numeric | Min, Max, Mean | Integers, Floats | ✅ Active |
| Standard Deviation | Integers, Floats | ✅ Active | |
| Skewness & Kurtosis | Integers, Floats | ✅ Active | |
| Exact Quantiles (p25, p50, p75) | Integers, Floats | ✅ Active | |
| Zero Count Detection | Integers, Floats | ✅ Active | |
| Streaming Histograms | Integers, Floats | ✅ Active | |
| Categorical / Text | Min / Max (Lexicographical) | Strings, Categoricals, Enums | ✅ Active |
| String Lengths (Min, Max, Mean) | Strings, Categoricals, Enums | ✅ Active | |
| Top-K Frequent Values | Strings, Categoricals, Enums | ✅ Active | |
| Regex / Pattern Matching | Strings | 🔮 Future | |
| Temporal | Min, Max, Span | Datetime, Date | 🔮 Future |
| Distribution by Time/Day | Datetime, Date | 🔮 Future | |
| Multivariate | Pearson Correlation Matrix | Integers, Floats | ✅ Active |
| Spearman Rank Correlation | Integers, Floats | ✅ Active | |
| Cramer's V (Categorical) | Strings, Categoricals | 🔮 Future | |
| Complex Types | Automatic Struct Flattening | Structs | ✅ Active |
| Array / List Length Distributions | Lists, Arrays | ✅ Active |
Installation
Netra Profiler is built for speed. We recommend installing it with uv, the blazing-fast Python package installer:
uv pip install netra-profiler
(Or use standard pip install netra-profiler)
Quickstart
1. Command Line
The fastest way to profile your data is right from the command line. netra-profiler natively supports .csv, .parquet, .json, and .arrow files.
netra profile path/to/your/dataset.csv
Advanced Execution Options
You can combine flags to handle massive or messy datasets with ease:
--fail-on-critical: Enables the Active Quality Gate. Breaks the pipeline (exits with code 1) if any CRITICAL anomalies are found.--fail-on-warnings: Stricter Quality Gate. Breaks the pipeline if ANY anomalies (Warning or Critical) are found.--low-memory: Triggers the low-memory execution path (approximate counting and sampled correlations).-i, --ignore <column>: Skip profiling for a specific column (perfect for highly cardinal IDs, hashes, or PII).--full-inference: Forces full-file schema inference. Crucial for messy CSVs where data types might silently change deep in the file.--json: Disables the visual CLI output and generates the raw profile payload as a JSON string. Ideal for piping tojqor redirecting to a file:> profile.json.
2. Python API
Netra Profiler exposes a fully typed Python API that accepts Polars DataFrames natively. The output is a rigidly typed Data Contract, making it perfect for programmatic quality gates.
import polars as pl
from netra_profiler import Profiler
# 1. Load your data using Polars (Eager or Lazy)
df = pl.scan_parquet("sales_data.parquet")
# 2. Initialize the Profiler with the configuration
profiler = Profiler(
df=df,
dataset_name="Q3_Sales",
ignore_columns=["transaction_id", "customer_hash"], # Drop highly cardinal IDs to save RAM
)
# 3. Execute the profiling graph
profile = profiler.run(bins=20, top_k=10)
# 4. Access the strictly typed metrics
print(f"Total Rows Profiled: {profile['dataset']['row_count']:,}")
if "revenue" in profile["columns"]:
mean_revenue = profile["columns"]["revenue"].get("mean")
print(f"Revenue Mean: ${mean_revenue:.2f}")
# 5. Programmatic Data Quality Gates
# Alerts are categorized by severity (CRITICAL, WARNING, INFO)
alerts = profile.get("alerts", [])
critical_issues = [a for a in alerts if a["level"] == "CRITICAL"]
if critical_issues:
print(f"\n[PIPELINE HALTED] Found {len(critical_issues)} critical data issues!")
for issue in critical_issues:
print(f" - [{issue['column_name']}] {issue['type']}: {issue['message']}")
raise ValueError("Data quality checks failed. Upstream data contract violated.")
3. Data Quality Rules
The rules for each column are resolved using a cascading method, where global rules are overridden by column-specific rules. If a check is too noisy for your dataset or analysis, you can explicitly disable it by setting its threshold to false.
To configure the data quality engine, use a netra_config.yaml file:
diagnostics:
# ---------------------------------------------------------
# GLOBAL THRESHOLDS: Apply to all columns by default
# ---------------------------------------------------------
global_thresholds:
# Null & Missing Data
null_critical_threshold: 0.95 # Alert CRITICAL if > 95% null (Empty Column)
null_warning_threshold: 0.50 # Alert WARNING if > 50% null (High Nulls)
# Variance & Entropy
constant_check_enabled: true # Alert CRITICAL if column has only 1 unique value
zero_inflated_threshold: 0.10 # Alert WARNING if > 10% of numeric values are zero
# Statistical Distribution
skew_threshold: 2.0 # Alert WARNING if absolute skewness exceeds 2.0
outlier_iqr_multiplier: 3.0 # Alert WARNING for extreme outliers (Tukey IQR method)
# Strings & Categoricals
high_cardinality_threshold: 10000 # Alert WARNING if unique strings > 10,000
string_length_anomaly_multiplier: 50.0 # Alert WARNING if max/min string length deviates from mean
# Identifiers & Primary Keys
id_uniqueness_threshold: 0.99 # Alert INFO if > 99% unique (Likely Primary Key)
min_rows_for_pk_check: 100 # Skip ID checks for tables smaller than 100 rows
# Schema & Correlation
possible_numeric_sample_size: 5 # Top-K sample size to detect Strings acting as Numbers
high_correlation_threshold: 0.95 # Alert WARNING if two numeric columns are > 95% correlated
# ---------------------------------------------------------
# COLUMN OVERRIDES: Surgical exceptions to global rules
# ---------------------------------------------------------
column_overrides:
# Example: 'middle_name' is expected to be mostly empty
middle_name:
null_critical_threshold: false # Disable critical null check completely
null_warning_threshold: 0.99 # Only warn if it's 99% empty
# Example: 'is_active' is a heavily imbalanced boolean flag
is_active:
constant_check_enabled: false # Prevent alerts if all users happen to be active
zero_inflated_threshold: false # Prevent alerts if most values are 0 (False)
# Example: 'customer_hash' is naturally highly cardinal
customer_hash:
high_cardinality_threshold: false # Disable cardinality warning for this specific ID
Integration
Netra will automatically look for netra_config.yaml in your current working directory. You can also explicitly pass it to the engine in three ways:
netra profile dataset.parquet --config path/to/custom_config.yaml
Via Environment Variable (Ideal for Docker/CI/CD):
export NETRA_CONFIG="/etc/netra/production_rules.yaml"
netra profile dataset.parquet
Via Python API:
import yaml
from netra_profiler import Profiler
# Option A: Load from a YAML file
with open("rules.yaml", "r") as f:
config = yaml.safe_load(f)
# Option B: Pass a dictionary directly
config = {
"diagnostics": {
"global_thresholds": {"null_critical_threshold": 0.80},
"column_overrides": {"status": {"constant_check_enabled": False}}
}
}
profiler = Profiler(df, config=config)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file netra_profiler-0.3.2.tar.gz.
File metadata
- Download URL: netra_profiler-0.3.2.tar.gz
- Upload date:
- Size: 314.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39e819fb452006b883833e8d124e12dbc12606969b57bdb2c5c49b00c01c7c73
|
|
| MD5 |
8eacf8a5bbd60c5620c472952fe66e6e
|
|
| BLAKE2b-256 |
6e1323838ac0968b980d95d4e45ae88a0e19cbd8851c709c6f2d05fafa3477a2
|
File details
Details for the file netra_profiler-0.3.2-py3-none-any.whl.
File metadata
- Download URL: netra_profiler-0.3.2-py3-none-any.whl
- Upload date:
- Size: 45.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
253526902170bd05df44eb6eb0ab242c53e8348477c047d66382013ac0c40217
|
|
| MD5 |
6b36923034bf718c844091e175f472b8
|
|
| BLAKE2b-256 |
a21208b46968d0df9747c117af48e02879dadc565afdc5681963dfc3d949d683
|