High-performance profiling and data quality tool built with Polars.

These details have not been verified by PyPI

Project links

Project description

Netra Profiler

High-performance profiling and data quality tool built with Polars

Netra Profiler is a high-performance data profiling tool and diagnostic engine built on top of Polars. It maximizes single-node hardware utilization by leveraging Polars' Rust-based query optimizer and columnar Apache Arrow memory model.

The profiler ships with a configurable diagnostic engine to detect data quality issues in your EDA or ELT workflows. It automatically flags anomalies like extreme zero-inflation, high cardinality, severe data skew, and corrupted primary keys. Netra includes an information-dense, zero-configuration CLI designed to instantly profile your CSV, JSON, IPC/Arrow, and Parquet files directly from the terminal.

Performance Benchmarks

Note: All the scripts used to fetch the dataset, run the benchmarks and generate the results can be found in the 'benchmarks/' directory.

Dataset

To ensure the benchmarks reflect the real-world friction of a typical data workload, we use the New York City TLC Yellow Taxi Trip Records dataset. It contains high-cardinality columns, high-nulls or missing data, and shifting schemas.

When reviewing the metrics below, please keep the following nuances in mind regarding the data:

Data Organization: The TLC publishes the data as individual .parquet files for every month of the year. To test raw I/O and schema harmonization, we process these files as is, without combining files or pre-processing the data.
The Timeline: We restricted the benchmarks to the years 2018-2024, as the schema remains relatively stable in this interval, and provides sufficient volume for the local benchmarks.
The COVID-19 Data Cliff: Pre-pandemic files (2018–2019) are significantly larger, containing 8 to 10 million trips per month compared to the 2 to 3 million trips in post-2020 files. For our local tests, we are predominantly using the older files.
Parquet Compression: All disk sizes referenced in these benchmarks represent the heavily compressed Snappy Parquet files. The actual uncompressed data expanding in-memory is roughly 6x to 10x larger than the on-disk size.

The Data Envelope

Data Envelope is the maximum size and complexity of data your data pipeline can process within your hardware limitations or cloud budget ceiling. Netra Profiler is designed to be a value multiplier for your existing hardware. This allows you to:

Stay Local Longer: Process larger workloads directly on your laptop or workstation without needing to migrate to an HPC or cloud platform.
Scale Vertically: Fully saturate a single heavy compute node (like an AWS EC2 instance) to bypass the overhead of complex, multi-node distributed frameworks like Apache Spark.
Preserve Productivity: Near-interactive profiling at Polars speed leaves no time to get up and grab a coffee while your profiler is spinning up!

A. Single-Node Workstation

All local benchmarks were executed on a consumer laptop machine with the following specifications:

CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (12 Cores)
RAM: 32 GB
OS: Ubuntu 24.04.4 LTS
Storage: 512GB NVMe SSD

We begin the benchmarks by determining the Data Envelope for this machine. To find the exact hardware redline, we tested both tools by incrementally feeding them months/files of data until an Out-Of-Memory (OOM) crash happened.

1. The Local Envelope

Profiler	Execution Mode	Maximum Safe Envelope
ydata-profiling	Standard	359.2 MB (~26.7 Million rows / 3 Months)
ydata-profiling	`minimal`^‡	728.0 MB (~54 Million rows / 6 Months)
netra-profiler	Standard	3.58 GB (~255.6 Million rows / 52 Months)
netra-profiler	`low-memory`^†	3.58 GB (~255.6 Million rows / 52 Months)

^‡ minimal (ydata-profiling): Turns off the most expensive computations, including the correlations.

^† low-memory (netra-profiler): Replaces exact unique counts with an approximate method (HyperLogLog) and skips global sorts (skew, kurtosis and quantiles). Computes the Pearson/Spearman correlation matrices by using a 100,000-row systematic sample.

Netra expands the local data envelope by nearly 5x, allowing developers to profile roughly 4.5 years of continuous NYC Taxi data directly on their laptop without migrating to the cloud.

2. Head-to-Head Performance

Having established the ~359.2 MB (3 Months) ceiling where both modes of ydata-profiling can successfully execute, we conduct a head-to-head performance comparison of the tools across the standard and efficiency modes.

Results below are averaged over 5 consecutive runs + 1 warmup run.

Execution Mode	netra-profiler	ydata-profiling
Standard (Full Stats)	12.48s (6.1 GB RAM)	572.39s (28.7 GB RAM)
`low-memory` / `minimal`	5.14s (4.7 GB RAM)	75.77s (15.3 GB RAM)

The standard run of ydata-profiling takes more than 9 minutes because Pandas loads all 26.7 million rows of the dataset into memory at once, exhausting the available physical memory and forcing the operating system to use the swap space on the hard drive to keep the process alive. Polars' lazy execution model and streaming data ingestion allow netra-profiler to profile the same data 45x faster (14x faster for the efficiency mode, with correlations) while using a fraction of the RAM.

B. Cloud Scale-Up (Vertical Scaling)

When scaling up to Cloud or HPC infrastructure to handle larger datasets, netra-profiler enables you to maximize the capacity of a single compute node by minimizing processing time and memory overhead. An expanded single-node data envelope allows your team to avoid complex distributed setups like Apache Spark for routine data profiling.

To demonstrate this, we benchmarked the engine on a standard Enterprise HPC node:

Machine: AWS EC2 r6id.8xlarge (Memory Optimized)
CPU: 32 vCPUs
RAM: 256 GB
OS: Ubuntu 24.04.4 LTS
Storage: Attached NVMe SSD

1. The Baseline Run (Full Dataset)

We first process the complete 84-month (7 years) dataset to establish the baseline memory requirement to handle the true cardinality of the data. The total file size on disk is 5.13 GB (362.2 Million Rows).

Execution Mode	Execution Time	Peak RAM
Standard	216.75s	80.4 GB
`low-memory`	50.75s	55.7 GB

2. The Throughput Stress Test

To push the hardware to its limits, we created larger datasets with bounded cardinality by using the Polars' Lazy API to replicate the available data. This simulates contexts like IoT telemetry or server logs, where the volume of data is effectively infinite, but the number of unique identifiers (sensors, IP addresses) is fixed. This benchmark effectively tests the maximum I/O throughput.

Scale	Compressed Size	Execution Mode	Execution Time	Peak RAM
4x	20.51 GB (1.45 Billion Rows)	Standard	829.68s (~14m)	158.6 GB
		`low-memory`	164.39s (~2.7m)	105.1 GB
10x	51.28 GB (3.62 Billion Rows)	Standard	2029.46s (~34m)	241.5 GB
		`low-memory`	417.27s (~7m)	171.2 GB

The 10x Standard run saturated ~94% (241.5 GB) of the instance's 256 GB physical memory to profile 51.28 GB of compressed data without triggering an OOM crash. This establishes the data envelope for this specific node size and dataset. Because compressed Parquet typically expands 5-10x in size in-memory, processing a 51 GB workload normally exceeds the physical limits of a 256 GB machine. By expanding the envelope of a vertically scaled single-node to safely process this volume in 33 minutes (or 7 minutes in low-memory mode), netra-profiler avoids the forced transition to costlier, higher-tier instances or the complexity of a distributed cluster.

C. Distributed Multi-Node (Horizontal Scaling)

Because the core engine of netra-profiler is built entirely on the Polars Lazy API, it is natively compatible with the Polars Distributed Layer out-of-the-box. Moving from a vertically scaled single-node workload to a horizontally scaled multi-node cluster will essentially be a low-friction configuration option.

Note: Native support for Polars Cloud & Distributed, and multi-node benchmarks are currently on the roadmap.

Features

Multi-Core Streaming Engine: Built on Polars, the profiling engine completely bypasses the Python GIL and utilizes 100% of your CPU cores for maximum performance. Unlike legacy tools that must load the entire dataset into memory for profiling, Netra processes data in streaming batches.
Low-Memory Mode: Process larger datasets safely. By passing the --low-memory flag, the profiler switches to approximate counting and sampling techniques to keep RAM usage low.
Comprehensive Profiling: Automatically extracts scalar statistics, distributions, and correlation matrices based on column data types. (See the Metrics Table below).
Complex Type Support: Automatically flattens nested JSON/Parquet Structs and computes length statistics for Lists and Arrays. Zero configuration required.
Built-in Configurable Quality Rules: Stop bad data before it enters your pipeline. Netra's diagnostic engine automatically flags anomalies like zero-inflation, corrupted primary keys, and extreme skewness. All detection thresholds can be customized globally or on a per-column basis via YAML (See Data Quality Rules below).
CI/CD Pipeline Gatekeeper: Use strict exit codes (--fail-on-critical or --fail-on-warnings) to automatically act as a Data Firewall, breaking your CI/CD builds (GitHub Actions, Airflow, GitLab CI) if corrupted data enters the pipeline.
Terminal UI: Includes an information-dense, highly readable CLI dashboard to profile and check your data health directly in the terminal.
Strictly Typed Profile Output: Access the complete mathematical state of your data via a strictly typed JSON export (--json) or native Python dictionary. Because the output schema is immutable, you can safely program against it to power custom CI/CD quality gates, feed metadata catalogs, or provide context to LLM data agents.
Python API: Integrate seamlessly into Airflow, Dagster, Marimo/Jupyter Notebooks, and custom pipelines with a clean, expressive programmatic interface.

Supported Metrics & Roadmap

Metric Category	Feature	Target Data Types	Status
Universal	Null Count	All Types	✅ Active
	Exact Cardinality (`n_unique`)	All Types	✅ Active
	Approximate Cardinality (HyperLogLog)	All Types (`--low-memory`)	✅ Active
Numeric	Min, Max, Mean	Integers, Floats	✅ Active
	Standard Deviation	Integers, Floats	✅ Active
	Skewness & Kurtosis	Integers, Floats	✅ Active
	Exact Quantiles (p25, p50, p75)	Integers, Floats	✅ Active
	Zero Count Detection	Integers, Floats	✅ Active
	Streaming Histograms	Integers, Floats	✅ Active
Categorical / Text	Min / Max (Lexicographical)	Strings, Categoricals, Enums	✅ Active
	String Lengths (Min, Max, Mean)	Strings, Categoricals, Enums	✅ Active
	Top-K Frequent Values	Strings, Categoricals, Enums	✅ Active
	Regex / Pattern Matching	Strings	🔮 Future
Temporal	Min, Max, Span	Datetime, Date	🔮 Future
	Distribution by Time/Day	Datetime, Date	🔮 Future
Multivariate	Pearson Correlation Matrix	Integers, Floats	✅ Active
	Spearman Rank Correlation	Integers, Floats	✅ Active
	Cramer's V (Categorical)	Strings, Categoricals	🔮 Future
Complex Types	Automatic Struct Flattening	Structs	✅ Active
	Array / List Length Distributions	Lists, Arrays	✅ Active

Installation

Netra Profiler is built for speed. We recommend installing it with uv, the blazing-fast Python package installer:

uv pip install netra-profiler

(Or use standard pip install netra-profiler)

Quickstart

1. Command Line

The fastest way to profile your data is right from the command line. netra-profiler natively supports .csv, .parquet, .json, and .arrow files.

netra profile path/to/your/dataset.csv

Netra Profiler CLI

Advanced Execution Options

You can combine flags to handle massive or messy datasets with ease:

--fail-on-critical: Enables the Active Quality Gate. Breaks the pipeline (exits with code 1) if any CRITICAL anomalies are found.
--fail-on-warnings: Stricter Quality Gate. Breaks the pipeline if ANY anomalies (Warning or Critical) are found.
--low-memory: Triggers the low-memory execution path (approximate counting and sampled correlations).
-i, --ignore <column>: Skip profiling for a specific column (perfect for highly cardinal IDs, hashes, or PII).
--full-inference: Forces full-file schema inference. Crucial for messy CSVs where data types might silently change deep in the file.
--json: Disables the visual CLI output and generates the raw profile payload as a JSON string. Ideal for piping to jq or redirecting to a file: > profile.json.

2. Python API

Netra Profiler exposes a fully typed Python API that accepts Polars DataFrames natively. The output is a rigidly typed Data Contract, making it perfect for programmatic quality gates.

import polars as pl
from netra_profiler import Profiler

# 1. Load your data using Polars (Eager or Lazy)
df = pl.scan_parquet("sales_data.parquet")

# 2. Initialize the Profiler with the configuration
profiler = Profiler(
    df=df,
    dataset_name="Q3_Sales",
    ignore_columns=["transaction_id", "customer_hash"], # Drop highly cardinal IDs to save RAM 
)

# 3. Execute the profiling graph
profile = profiler.run(bins=20, top_k=10)

# 4. Access the strictly typed metrics
print(f"Total Rows Profiled: {profile['dataset']['row_count']:,}")

if "revenue" in profile["columns"]:
    mean_revenue = profile["columns"]["revenue"].get("mean")
    print(f"Revenue Mean: ${mean_revenue:.2f}")

# 5. Programmatic Data Quality Gates
# Alerts are categorized by severity (CRITICAL, WARNING, INFO)
alerts = profile.get("alerts", [])
critical_issues = [a for a in alerts if a["level"] == "CRITICAL"]

if critical_issues:
    print(f"\n[PIPELINE HALTED] Found {len(critical_issues)} critical data issues!")
    for issue in critical_issues:
        print(f" - [{issue['column_name']}] {issue['type']}: {issue['message']}")
    raise ValueError("Data quality checks failed. Upstream data contract violated.")

3. Data Quality Rules

The rules for each column are resolved using a cascading method, where global rules are overridden by column-specific rules. If a check is too noisy for your dataset or analysis, you can explicitly disable it by setting its threshold to false.

To configure the data quality engine, use a netra_config.yaml file:

diagnostics:
  # ---------------------------------------------------------
  # GLOBAL THRESHOLDS: Apply to all columns by default
  # ---------------------------------------------------------
  global_thresholds:
    # Null & Missing Data
    null_critical_threshold: 0.95           # Alert CRITICAL if > 95% null (Empty Column)
    null_warning_threshold: 0.50            # Alert WARNING if > 50% null (High Nulls)
    
    # Variance & Entropy
    constant_check_enabled: true            # Alert CRITICAL if column has only 1 unique value
    zero_inflated_threshold: 0.10           # Alert WARNING if > 10% of numeric values are zero
    
    # Statistical Distribution
    skew_threshold: 2.0                     # Alert WARNING if absolute skewness exceeds 2.0
    outlier_iqr_multiplier: 3.0             # Alert WARNING for extreme outliers (Tukey IQR method)
    
    # Strings & Categoricals
    high_cardinality_threshold: 10000       # Alert WARNING if unique strings > 10,000
    string_length_anomaly_multiplier: 50.0  # Alert WARNING if max/min string length deviates from mean
    
    # Identifiers & Primary Keys
    id_uniqueness_threshold: 0.99           # Alert INFO if > 99% unique (Likely Primary Key)
    min_rows_for_pk_check: 100              # Skip ID checks for tables smaller than 100 rows
    
    # Schema & Correlation
    possible_numeric_sample_size: 5         # Top-K sample size to detect Strings acting as Numbers
    high_correlation_threshold: 0.95        # Alert WARNING if two numeric columns are > 95% correlated

  # ---------------------------------------------------------
  # COLUMN OVERRIDES: Surgical exceptions to global rules
  # ---------------------------------------------------------
  column_overrides:
    # Example: 'middle_name' is expected to be mostly empty
    middle_name:
      null_critical_threshold: false        # Disable critical null check completely
      null_warning_threshold: 0.99          # Only warn if it's 99% empty
      
    # Example: 'is_active' is a heavily imbalanced boolean flag
    is_active:
      constant_check_enabled: false         # Prevent alerts if all users happen to be active
      zero_inflated_threshold: false        # Prevent alerts if most values are 0 (False)
      
    # Example: 'customer_hash' is naturally highly cardinal
    customer_hash:
      high_cardinality_threshold: false     # Disable cardinality warning for this specific ID

Integration

Netra will automatically look for netra_config.yaml in your current working directory. You can also explicitly pass it to the engine in three ways:

netra profile dataset.parquet --config path/to/custom_config.yaml

Via Environment Variable (Ideal for Docker/CI/CD):

export NETRA_CONFIG="/etc/netra/production_rules.yaml"
netra profile dataset.parquet

Via Python API:

import yaml
from netra_profiler import Profiler

# Option A: Load from a YAML file
with open("rules.yaml", "r") as f:
    config = yaml.safe_load(f)

# Option B: Pass a dictionary directly
config = {
    "diagnostics": {
        "global_thresholds": {"null_critical_threshold": 0.80},
        "column_overrides": {"status": {"constant_check_enabled": False}}
    }
}

profiler = Profiler(df, config=config)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.2

May 5, 2026

0.3.1

May 3, 2026

0.3.0

Apr 22, 2026

0.2.0

Apr 16, 2026

0.1.1

Apr 5, 2026

0.1.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

netra_profiler-0.3.2.tar.gz (314.0 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

netra_profiler-0.3.2-py3-none-any.whl (45.7 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file netra_profiler-0.3.2.tar.gz.

File metadata

Download URL: netra_profiler-0.3.2.tar.gz
Upload date: May 5, 2026
Size: 314.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.3

File hashes

Hashes for netra_profiler-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`39e819fb452006b883833e8d124e12dbc12606969b57bdb2c5c49b00c01c7c73`
MD5	`8eacf8a5bbd60c5620c472952fe66e6e`
BLAKE2b-256	`6e1323838ac0968b980d95d4e45ae88a0e19cbd8851c709c6f2d05fafa3477a2`

See more details on using hashes here.

File details

Details for the file netra_profiler-0.3.2-py3-none-any.whl.

File metadata

Download URL: netra_profiler-0.3.2-py3-none-any.whl
Upload date: May 5, 2026
Size: 45.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.3

File hashes

Hashes for netra_profiler-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`253526902170bd05df44eb6eb0ab242c53e8348477c047d66382013ac0c40217`
MD5	`6b36923034bf718c844091e175f472b8`
BLAKE2b-256	`a21208b46968d0df9747c117af48e02879dadc565afdc5681963dfc3d949d683`

See more details on using hashes here.

netra-profiler 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Netra Profiler

High-performance profiling and data quality tool built with Polars

Performance Benchmarks

Dataset

The Data Envelope

A. Single-Node Workstation

1. The Local Envelope

2. Head-to-Head Performance

B. Cloud Scale-Up (Vertical Scaling)

1. The Baseline Run (Full Dataset)

2. The Throughput Stress Test

C. Distributed Multi-Node (Horizontal Scaling)

Features

Supported Metrics & Roadmap

Installation

Quickstart

1. Command Line

Advanced Execution Options

2. Python API

3. Data Quality Rules

Integration

Via Environment Variable (Ideal for Docker/CI/CD):

Via Python API:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes