Skip to main content

High-performance profiling and data quality tool built with Polars.

Project description

Netra Profiler

High-performance profiling and data quality tool built with Polars

Netra Profiler is a next-generation data profiling tool and diagnostic engine built on top of Polars. Designed to operate at the speed of your disk I/O, it leverages Polars' Rust-based query optimizer and zero-copy Apache Arrow memory model to maximize the profiling capabilities of your local hardware. Netra processes massive datasets with predictable, linear RAM usage, eliminating the sudden memory spikes and crashes associated with traditional Python tools.

The profiler ships with a comprehensive diagnostic engine to detect column-wise data quality issues early in your analysis or modeling workflows, such as high zeros/null count, high cardinality, data skew and more. The tool includes a detailed, zero-configuration CLI for quickly profiling your CSV, JSON, Arrow/IPC or Parquet files.

Performance & The Data Envelope

Data Envelope is the maximum size and complexity of data your organization can process within your hardware limitations or cloud cost limits. Netra Profiler is designed to be a value multiplier for your existing hardware by expanding your data envelope to include larger data workloads, and optimize your current workflow with faster, more efficient processing, which means less time and costs spent running profiling tasks.

A. Single-Node Workstation

Tested locally on a consumer laptop machine with the following specifications:

  • CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (12 Cores)
  • RAM: 32 GB
  • OS: Ubuntu 24.04.4 LTS
  • Storage: 512GB NVMe SSD

Dataset Schema: 11 Columns (3 Int, 4 String, 2 Float, 1 Bool, 1 Date), 12 Million Rows. Details can be found in the generation script here.

File Size: 1.04 GB (CSV) / 225 MB (Parquet)

Execution Mode netra-profiler (Parquet) ydata-profiling (Parquet) netra-profiler (CSV) ydata-profiling (CSV)
All stats, All columns 3.20s
(3.9 GB RAM)
181.26s
(13.7 GB RAM)
6.23s
(7.1 GB RAM)
222.05s
(11.5 GB RAM)
Ignore Primary Key 2.78s
(2.1 GB RAM)
134.97s
(10.4 GB RAM)
5.60s
(6.4 GB RAM)
161.74s
(8.6 GB RAM)
low-memory / minimal 1.41s
(1.9 GB RAM)
24.54s
(5.0 GB RAM)
3.92s
(5.4 GB RAM)
33.25s
(3.9 GB RAM)

low-memory (netra-profiler): Replaces exact unique counts with an approximate method (HyperLogLog) and skips global sorts (skew, kurtosis and quantiles). Crucially, it retains the Pearson/Spearman correlation matrices by using a 100,000-row systematic sample.

minimal (ydata-profiling): Turns off the most expensive computations, and entirely disables the correlation matrices.

Performance Takeaways

  • Netra Profiler is 35x to 56x faster than traditional Pandas-based profiling out-of-the-box, and uses up to 71% less memory. Standard workloads that take minutes now finish in seconds.
  • When dropping highly cardinal primary keys, the engine maintains a ~48x speed advantage while operating on just 2.1 GB of RAM (an 80% reduction vs Pandas).
  • The low-memory mode maximizes your hardware's Data Envelope, allowing consumer hardware to handle large data workloads that otherwise require a scaled-up cloud node or distributed compute.

To determine the actual extent of the Data Envelope for the test hardware, we ran the profiler on an increasing row size of the benchmark dataset, in Parquet format:

Rows Engine Time Peak RAM Usage
12M 1.41s 1.9 GB
100M 11.08s 5.5 GB
300M 33.76s 13.2 GB
500M 57.38s 20.9 GB
700M 83.16s 25.2 GB
900M 104.00s 29.7 GB
950M 114.35s 30.3 GB

netra-profiler engine scales linearly and predictably thanks to the streaming-first architecture and the sampling strategy for calculating correlations, which keeps the memory footprint strictly proportional to the dataset size. This predictable scaling eliminates sudden Out-Of-Memory (OOM) crashes and allows you to accurately forecast your hardware limits.

Netra Profiler Linear Scaling: 12M to 950M Rows

B. Cloud Scale-Up (Vertical Scaling)

For multi-billion row datasets, deploying the profiler on a single heavy cloud instance bypasses the network-shuffle and orchestration bottlenecks of distributed systems, offering extreme performance without the cluster management overhead.

(Benchmarks are in development)

C. Distributed Multi-Node (Horizontal Scaling)

Netra Profiler’s core engine is built purely on the Polars Lazy API, which means it is natively compatible with the Polars Distributed Layer out-of-the-box. Moving from a local 1-Billion row workload to a multi-node 100-Billion row cloud workload requires zero code rewrites.

(Benchmarks are in development)

Features

  • Multi-Core Streaming Engine: Built on Polars to completely bypass the Python GIL and utilize 100% of your CPU cores. By leveraging zero-copy Apache Arrow memory, Netra streams data directly from disk, eliminating the massive intermediate RAM spikes associated with traditional Pandas-based data processing.
  • Low-Memory Mode: Process large datasets without crashing your machine. By passing the --low-memory flag, Netra intelligently switches to approximate counting and sampling techniques to keep RAM usage low.
  • Comprehensive Profiling: Automatically extracts scalar statistics (min, max, mean, skew, kurtosis), streaming distributions (histograms), Top-K frequent values, and Pearson/Spearman correlation matrices.
  • Complex Type Support: Automatically flattens nested Structs and computes length statistics for Lists and Arrays, allowing you to profile complex JSON or Parquet files with zero configuration.
  • Built-in Quality Alerts: Stop bad data before it enters your pipeline. Netra's diagnostics engine automatically flags critical issues like zero-inflation, corrupted primary keys, extreme skewness, and high null percentages.
  • CI/CD Pipeline Gatekeeper: Use strict exit codes (--fail-on-critical or --fail-on-warnings) to automatically act as a Data Firewall, breaking your CI/CD builds (GitHub Actions, Airflow, GitLab CI) if corrupted data enters the pipeline.
  • Beautiful Terminal UI: Includes an information-dense, highly readable CLI dashboard to profile and check your data health directly in the terminal.
  • JSON Data Contracts: Export the full diagnostic profile to a strictly typed JSON artifact (netra profile data.parquet --json) for CI/CD data quality gates, a metadata feed for data catalogs, or context for LLM-based data agents.
  • Python API: Integrate seamlessly into your data engineering pipelines (Airflow DAGs, Marimo/Jupyter Notebooks, CI/CD) with a clean, expressive programmatic interface.

Installation

Netra Profiler is built for speed. We recommend installing it with uv, the blazing-fast Python package installer:

uv pip install netra-profiler

(Or use standard pip install netra-profiler)

Quickstart

1. The CLI

The fastest way to profile your data is right from the command line. netra-profiler natively supports .csv, .parquet, .json, and .arrow files.

netra profile path/to/your/dataset.csv

Netra Profiler CLI

Advanced Execution Options

You can combine flags to handle massive or messy datasets with ease:

  • --fail-on-critical: Enables the Active Quality Gate. Breaks the pipeline (exits with code 1) if any CRITICAL anomalies are found.
  • --fail-on-warnings: Stricter Quality Gate. Breaks the pipeline if ANY anomalies (Warning or Critical) are found.
  • --low-memory: Triggers the low-memory execution path (approximate counting and sampled correlations).
  • -i, --ignore <column>: Skip profiling for a specific column (perfect for highly cardinal IDs, hashes, or PII).
  • --full-inference: Forces full-file schema inference. Crucial for messy CSVs where data types might silently change deep in the file.
  • --json: Disables the visual CLI output and generates the raw profile payload as a JSON string. Ideal for piping to jq or redirecting to a file: > profile.json.

2. The Python API

Netra Profiler exposes a fully typed Python API that accepts Polars DataFrames natively. The output is a rigidly typed Data Contract, making it perfect for programmatic quality gates.

import polars as pl
from netra_profiler import Profiler

# 1. Load your data using Polars (Eager or Lazy)
df = pl.scan_parquet("sales_data.parquet")

# 2. Initialize the Profiler with the configuration
profiler = Profiler(
    df=df,
    dataset_name="Q3_Sales",
    ignore_columns=["transaction_id", "customer_hash"], # Drop highly cardinal IDs to save RAM 
)

# 3. Execute the profiling graph
profile = profiler.run(bins=20, top_k=10)

# 4. Access the strictly typed metrics
print(f"Total Rows Profiled: {profile['dataset']['row_count']:,}")

if "revenue" in profile["columns"]:
    mean_revenue = profile["columns"]["revenue"].get("mean")
    print(f"Revenue Mean: ${mean_revenue:.2f}")

# 5. Programmatic Data Quality Gates
# Alerts are categorized by severity (CRITICAL, WARNING, INFO)
alerts = profile.get("alerts", [])
critical_issues = [a for a in alerts if a["level"] == "CRITICAL"]

if critical_issues:
    print(f"\n[PIPELINE HALTED] Found {len(critical_issues)} critical data issues!")
    for issue in critical_issues:
        print(f" - [{issue['column_name']}] {issue['type']}: {issue['message']}")
    raise ValueError("Data quality checks failed. Upstream data contract violated.")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

netra_profiler-0.3.0.tar.gz (268.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

netra_profiler-0.3.0-py3-none-any.whl (42.0 kB view details)

Uploaded Python 3

File details

Details for the file netra_profiler-0.3.0.tar.gz.

File metadata

  • Download URL: netra_profiler-0.3.0.tar.gz
  • Upload date:
  • Size: 268.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.3

File hashes

Hashes for netra_profiler-0.3.0.tar.gz
Algorithm Hash digest
SHA256 8699bc852d8f05e15a90f3a07c86e801a545678fc7ae7fc9eb074a046c98f4a6
MD5 576ea913f8d74f863df82b9191b505a5
BLAKE2b-256 c3e5680642d837691c50ef54d8d6a75b44856152d3806b18e8bf34d9a9ddef9f

See more details on using hashes here.

File details

Details for the file netra_profiler-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for netra_profiler-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 74404dcf1f908232236b4e798fb740330e4bdb07f68d0d8699d25708c9b5d1f6
MD5 6b94f38c6ae070727773876616ecb0a0
BLAKE2b-256 3203c7325b357c0edaf8ba56d039390a1ce45d5b3a743fbb09d09de9a73e4344

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page