Skip to main content

High-performance profiling and data quality tool built with Polars.

Project description

Netra Profiler

High-performance profiling and data quality tool built with Polars

Netra Profiler is a next-generation data profiling tool and diagnostic engine built on top of Polars. Designed to operate at the speed of your disk I/O, it leverages Polars' Rust-based query optimizer and zero-copy Apache Arrow memory model to maximize the profiling capabilities of your local hardware. Netra processes massive datasets with predictable, linear RAM usage, eliminating the sudden memory spikes and crashes associated with traditional Python tools.

The profiler ships with a comprehensive diagnostic engine to detect column-wise data quality issues early in your analysis or modeling workflows, such as high zeros/null count, high cardinality, data skew and more. The tool includes a detailed, zero-configuration CLI for quickly profiling your CSV, JSON, Arrow/IPC or Parquet files.

Performance & The Data Envelope

Data Envelope is the maximum size and complexity of data your organization can process within your hardware limitations or cloud cost limits. Netra Profiler is designed to be a value multiplier for your existing hardware by expanding your data envelope to include larger data workloads, and optimize your current workflow with faster, more efficient processing, which means less time and costs spent running profiling tasks.

A. Single-Node Workstation

Tested locally on a consumer laptop machine with the following specifications:

  • CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (12 Cores)
  • RAM: 32 GB
  • OS: Ubuntu 24.04.4 LTS
  • Storage: 512GB NVMe SSD

Dataset Schema: 11 Columns (3 Int, 4 String, 2 Float, 1 Bool, 1 Date), 12 Million Rows. Details can be found in the generation script here.

File Size: 1.04 GB (CSV) / 225 MB (Parquet)

Execution Mode netra-profiler (Parquet) ydata-profiling (Parquet) netra-profiler (CSV) ydata-profiling (CSV)
All stats, All columns 3.20s
(3.9 GB RAM)
181.26s
(13.7 GB RAM)
6.23s
(7.1 GB RAM)
222.05s
(11.5 GB RAM)
Ignore Primary Key 2.78s
(2.1 GB RAM)
134.97s
(10.4 GB RAM)
5.60s
(6.4 GB RAM)
161.74s
(8.6 GB RAM)
low-memory / minimal 1.41s
(1.9 GB RAM)
24.54s
(5.0 GB RAM)
3.92s
(5.4 GB RAM)
33.25s
(3.9 GB RAM)

low-memory (netra-profiler): Replaces exact unique counts with an approximate method (HyperLogLog) and skips global sorts (skew, kurtosis and quantiles). Crucially, it retains the Pearson/Spearman correlation matrices by using a 100,000-row systematic sample.

minimal (ydata-profiling): Turns off the most expensive computations, and entirely disables the correlation matrices.

Performance Takeaways

  • Netra Profiler is 35x to 56x faster than traditional Pandas-based profiling out-of-the-box, and uses up to 71% less memory. Standard workloads that take minutes now finish in seconds.
  • When dropping highly cardinal primary keys, the engine maintains a ~48x speed advantage while operating on just 2.1 GB of RAM (an 80% reduction vs Pandas).
  • The low-memory mode maximizes your hardware's Data Envelope, allowing consumer hardware to handle large data workloads that otherwise require a scaled-up cloud node or distributed compute.

To determine the actual extent of the Data Envelope for the test hardware, we ran the profiler on an increasing row size of the benchmark dataset, in Parquet format:

Rows Engine Time Peak RAM Usage
12M 1.41s 1.9 GB
100M 11.08s 5.5 GB
300M 33.76s 13.2 GB
500M 57.38s 20.9 GB
700M 83.16s 25.2 GB
900M 104.00s 29.7 GB
950M 114.35s 30.3 GB

netra-profiler engine scales linearly and predictably thanks to the streaming-first architecture and the sampling strategy for calculating correlations, which keeps the memory footprint strictly proportional to the dataset size. This predictable scaling eliminates sudden Out-Of-Memory (OOM) crashes and allows you to accurately forecast your hardware limits.

Netra Profiler Linear Scaling: 12M to 950M Rows

B. Cloud Scale-Up (Vertical Scaling)

For multi-billion row datasets, deploying the profiler on a single heavy cloud instance bypasses the network-shuffle and orchestration bottlenecks of distributed systems, offering extreme performance without the cluster management overhead.

(Benchmarks are in development)

C. Distributed Multi-Node (Horizontal Scaling)

Netra Profiler’s core engine is built purely on the Polars Lazy API, which means it is natively compatible with the Polars Distributed Layer out-of-the-box. Moving from a local 1-Billion row workload to a multi-node 100-Billion row cloud workload requires zero code rewrites.

(Benchmarks are in development)

Features

  • Multi-Core Streaming Engine: Built on Polars to completely bypass the Python GIL and utilize 100% of your CPU cores. By leveraging zero-copy Apache Arrow memory, Netra streams data directly from disk, eliminating the massive intermediate RAM spikes associated with traditional Pandas-based data processing.
  • Low-Memory Mode: Process large datasets without crashing your machine. By passing the --low-memory flag, Netra intelligently switches to approximate counting and sampling techniques to keep RAM usage low.
  • Comprehensive Profiling: Automatically extracts scalar statistics (min, max, mean, skew, kurtosis), streaming distributions (histograms), Top-K frequent values, and Pearson/Spearman correlation matrices.
  • Complex Type Support: Automatically flattens nested Structs and computes length statistics for Lists and Arrays, allowing you to profile complex JSON or Parquet files with zero configuration.
  • Built-in Quality Alerts: Stop bad data before it enters your pipeline. Netra's diagnostics engine automatically flags critical issues like zero-inflation, corrupted primary keys, extreme skewness, and high null percentages.
  • CI/CD Pipeline Gatekeeper: Use strict exit codes (--fail-on-critical or --fail-on-warnings) to automatically act as a Data Firewall, breaking your CI/CD builds (GitHub Actions, Airflow, GitLab CI) if corrupted data enters the pipeline.
  • Beautiful Terminal UI: Includes an information-dense, highly readable CLI dashboard to profile and check your data health directly in the terminal.
  • JSON Data Contracts: Export the full diagnostic profile to a strictly typed JSON artifact (netra profile data.parquet --json) for CI/CD data quality gates, a metadata feed for data catalogs, or context for LLM-based data agents.
  • Python API: Integrate seamlessly into your data engineering pipelines (Airflow DAGs, Marimo/Jupyter Notebooks, CI/CD) with a clean, expressive programmatic interface.

Installation

Netra Profiler is built for speed. We recommend installing it with uv, the blazing-fast Python package installer:

uv pip install netra-profiler

(Or use standard pip install netra-profiler)

Quickstart

1. The CLI

The fastest way to profile your data is right from the command line. netra-profiler natively supports .csv, .parquet, .json, and .arrow files.

netra profile path/to/your/dataset.csv

Netra Profiler CLI

Advanced Execution Options

You can combine flags to handle massive or messy datasets with ease:

  • --fail-on-critical: Enables the Active Quality Gate. Breaks the pipeline (exits with code 1) if any CRITICAL anomalies are found.
  • --fail-on-warnings: Stricter Quality Gate. Breaks the pipeline if ANY anomalies (Warning or Critical) are found.
  • --low-memory: Triggers the low-memory execution path (approximate counting and sampled correlations).
  • -i, --ignore <column>: Skip profiling for a specific column (perfect for highly cardinal IDs, hashes, or PII).
  • --full-inference: Forces full-file schema inference. Crucial for messy CSVs where data types might silently change deep in the file.
  • --json: Disables the visual CLI output and generates the raw profile payload as a JSON string. Ideal for piping to jq or redirecting to a file: > profile.json.

2. The Python API

Netra Profiler exposes a fully typed Python API that accepts Polars DataFrames natively. The output is a rigidly typed Data Contract, making it perfect for programmatic quality gates.

import polars as pl
from netra_profiler import Profiler

# 1. Load your data using Polars (Eager or Lazy)
df = pl.scan_parquet("sales_data.parquet")

# 2. Initialize the Profiler with the configuration
profiler = Profiler(
    df=df,
    dataset_name="Q3_Sales",
    ignore_columns=["transaction_id", "customer_hash"], # Drop highly cardinal IDs to save RAM 
)

# 3. Execute the profiling graph
profile = profiler.run(bins=20, top_k=10)

# 4. Access the strictly typed metrics
print(f"Total Rows Profiled: {profile['dataset']['row_count']:,}")

if "revenue" in profile["columns"]:
    mean_revenue = profile["columns"]["revenue"].get("mean")
    print(f"Revenue Mean: ${mean_revenue:.2f}")

# 5. Programmatic Data Quality Gates
# Alerts are categorized by severity (CRITICAL, WARNING, INFO)
alerts = profile.get("alerts", [])
critical_issues = [a for a in alerts if a["level"] == "CRITICAL"]

if critical_issues:
    print(f"\n[PIPELINE HALTED] Found {len(critical_issues)} critical data issues!")
    for issue in critical_issues:
        print(f" - [{issue['column_name']}] {issue['type']}: {issue['message']}")
    raise ValueError("Data quality checks failed. Upstream data contract violated.")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

netra_profiler-0.2.0.tar.gz (260.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

netra_profiler-0.2.0-py3-none-any.whl (39.4 kB view details)

Uploaded Python 3

File details

Details for the file netra_profiler-0.2.0.tar.gz.

File metadata

  • Download URL: netra_profiler-0.2.0.tar.gz
  • Upload date:
  • Size: 260.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.3

File hashes

Hashes for netra_profiler-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d87198f965e5fc04965b2e429a8279da0e5eb9192c2996e7ed7003799bcecfcd
MD5 fedadbd64eb23588900f9df8b15810ff
BLAKE2b-256 b1e880b55a1f6bbd2605dee5f304417ba885c2d64b246028a310022137684b9b

See more details on using hashes here.

File details

Details for the file netra_profiler-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for netra_profiler-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad175dc00d97cea95ddcbbb08cb7e579fcbfcddb1b031c69ce5efafb687eb223
MD5 9c229606ddbb4472c6d02103ebdb13af
BLAKE2b-256 55b9f5c2c618d12707bf3160510af01c6f95f675f1a097e1a45e5cf098df9757

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page