High-performance profiling and data quality tool built with Polars.

Project description

Netra Profiler

High-performance profiling and data quality tool built with Polars

Netra Profiler is a next-generation data profiling tool and diagnostic engine built on top of Polars. Designed to operate at the speed of your disk I/O, it leverages Polars' Rust-based query optimizer and zero-copy Apache Arrow memory model to maximize the profiling capabilities of your local hardware. Netra processes massive datasets with predictable, linear RAM usage, eliminating the sudden memory spikes and crashes associated with traditional Python tools.

The profiler ships with a comprehensive diagnostic engine to detect column-wise data quality issues early in your analysis or modeling workflows, such as high zeros/null count, high cardinality, data skew and more. The tool includes a detailed, zero-configuration CLI for quickly profiling your CSV, JSON, Arrow/IPC or Parquet files.

Performance & The Data Envelope

Data Envelope is the maximum size and complexity of data your organization can process within your hardware limitations or cloud cost limits. Netra Profiler is designed to be a value multiplier for your existing hardware by expanding your data envelope to include larger data workloads, and optimize your current workflow with faster, more efficient processing, which means less time and costs spent running profiling tasks.

A. Single-Node Workstation

Tested locally on a consumer laptop machine with the following specifications:

CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (12 Cores)
RAM: 32 GB
OS: Ubuntu 24.04.4 LTS
Storage: 512GB NVMe SSD

Dataset Schema: 11 Columns (3 Int, 4 String, 2 Float, 1 Bool, 1 Date), 12 Million Rows. Details can be found in the generation script here.

File Size: 1.04 GB (CSV) / 225 MB (Parquet)

Execution Mode	netra-profiler (Parquet)	ydata-profiling (Parquet)	netra-profiler (CSV)	ydata-profiling (CSV)
All stats, All columns	3.20s (3.9 GB RAM)	181.26s (13.7 GB RAM)	6.23s (7.1 GB RAM)	222.05s (11.5 GB RAM)
Ignore Primary Key	2.78s (2.1 GB RAM)	134.97s (10.4 GB RAM)	5.60s (6.4 GB RAM)	161.74s (8.6 GB RAM)
`low-memory`^† / `minimal`^‡	1.41s (1.9 GB RAM)	24.54s (5.0 GB RAM)	3.92s (5.4 GB RAM)	33.25s (3.9 GB RAM)

^† low-memory (netra-profiler): Replaces exact unique counts with an approximate method (HyperLogLog) and skips global sorts (skew, kurtosis and quantiles). Crucially, it retains the Pearson/Spearman correlation matrices by using a 100,000-row systematic sample.

^‡ minimal (ydata-profiling): Turns off the most expensive computations, and entirely disables the correlation matrices.

Performance Takeaways

Netra Profiler is 35x to 56x faster than traditional Pandas-based profiling out-of-the-box, and uses up to 71% less memory. Standard workloads that take minutes now finish in seconds.
When dropping highly cardinal primary keys, the engine maintains a ~48x speed advantage while operating on just 2.1 GB of RAM (an 80% reduction vs Pandas).
The low-memory mode maximizes your hardware's Data Envelope, allowing consumer hardware to handle large data workloads that otherwise require a scaled-up cloud node or distributed compute.

To determine the actual extent of the Data Envelope for the test hardware, we ran the profiler on an increasing row size of the benchmark dataset, in Parquet format:

Rows	Engine Time	Peak RAM Usage
12M	1.41s	1.9 GB
100M	11.08s	5.5 GB
300M	33.76s	13.2 GB
500M	57.38s	20.9 GB
700M	83.16s	25.2 GB
900M	104.00s	29.7 GB
950M	114.35s	30.3 GB

netra-profiler engine scales linearly and predictably thanks to the streaming-first architecture and the sampling strategy for calculating correlations, which keeps the memory footprint strictly proportional to the dataset size. This predictable scaling eliminates sudden Out-Of-Memory (OOM) crashes and allows you to accurately forecast your hardware limits.

Netra Profiler Linear Scaling: 12M to 950M Rows

B. Cloud Scale-Up (Vertical Scaling)

For multi-billion row datasets, deploying the profiler on a single heavy cloud instance bypasses the network-shuffle and orchestration bottlenecks of distributed systems, offering extreme performance without the cluster management overhead.

(Benchmarks are in development)

C. Distributed Multi-Node (Horizontal Scaling)

Netra Profiler’s core engine is built purely on the Polars Lazy API, which means it is natively compatible with the Polars Distributed Layer out-of-the-box. Moving from a local 1-Billion row workload to a multi-node 100-Billion row cloud workload requires zero code rewrites.

(Benchmarks are in development)

Features

Multi-Core Streaming Engine: Built on Polars to completely bypass the Python GIL and utilize 100% of your CPU cores. By leveraging zero-copy Apache Arrow memory, Netra streams data directly from disk, eliminating the massive intermediate RAM spikes associated with traditional Pandas-based data processing.
Low-Memory Mode: Process large datasets without crashing your machine. By passing the --low-memory flag, Netra intelligently switches to approximate counting and sampling techniques to keep RAM usage low.
Comprehensive Profiling: Automatically extracts scalar statistics (min, max, mean, skew, kurtosis), streaming distributions (histograms), Top-K frequent values, and Pearson/Spearman correlation matrices.
Complex Type Support: Automatically flattens nested Structs and computes length statistics for Lists and Arrays, allowing you to profile complex JSON or Parquet files with zero configuration.
Built-in Quality Alerts: Stop bad data before it enters your pipeline. Netra's diagnostics engine automatically flags critical issues like zero-inflation, corrupted primary keys, extreme skewness, and high null percentages.
Beautiful Terminal UI: Includes an information-dense, highly readable CLI dashboard to profile and check your data health directly in the terminal.
JSON Data Contracts: Export the full diagnostic profile to a strictly typed JSON artifact (netra profile data.parquet --json) for CI/CD data quality gates, a metadata feed for data catalogs, or context for LLM-based data agents.
Python API: Integrate seamlessly into your data engineering pipelines (Airflow DAGs, Marimo/Jupyter Notebooks, CI/CD) with a clean, expressive programmatic interface.

Installation

Netra Profiler is built for speed. We recommend installing it with uv, the blazing-fast Python package installer:

uv pip install netra-profiler

(Or use standard pip install netra-profiler)

Quickstart

1. The CLI

The fastest way to profile your data is right from the command line. netra-profiler natively supports .csv, .parquet, .json, and .arrow files.

netra profile path/to/your/dataset.csv

Netra Profiler CLI

Advanced Execution Options

You can combine flags to handle massive or messy datasets with ease:

--low-memory: Triggers the bounded-memory execution path (approximate counting and map-reduce sampling) to profile out-of-core datasets without crashing your RAM.
-i, --ignore <column>: Skip profiling for a specific column (perfect for highly cardinal IDs, hashes, or PII).
--full-inference: Forces full-file schema inference. Crucial for messy CSVs where data types might silently change deep in the file.
--json: Disables the visual CLI output and generates the raw profile payload as a JSON string. Ideal for piping to jq or redirecting to a file: > profile.json.

2. The Python API

Netra Profiler exposes a fully typed Python API that accepts Polars DataFrames natively. The output is a rigidly typed Data Contract, making it perfect for programmatic quality gates.

import polars as pl
from netra_profiler import Profiler

# 1. Load your data using Polars (Eager or Lazy)
df = pl.scan_parquet("sales_data.parquet")

# 2. Initialize the Profiler with the configuration
profiler = Profiler(
    df=df,
    dataset_name="Q3_Sales",
    ignore_columns=["transaction_id", "customer_hash"], # Drop highly cardinal IDs to save RAM 
)

# 3. Execute the profiling graph
profile = profiler.run(bins=20, top_k=10)

# 4. Access the strictly typed metrics
print(f"Total Rows Profiled: {profile['dataset']['row_count']:,}")

if "revenue" in profile["columns"]:
    mean_revenue = profile["columns"]["revenue"].get("mean")
    print(f"Revenue Mean: ${mean_revenue:.2f}")

# 5. Programmatic Data Quality Gates
# Alerts are categorized by severity (CRITICAL, WARNING, INFO)
alerts = profile.get("alerts", [])
critical_issues = [a for a in alerts if a["level"] == "CRITICAL"]

if critical_issues:
    print(f"\n[PIPELINE HALTED] Found {len(critical_issues)} critical data issues!")
    for issue in critical_issues:
        print(f" - [{issue['column_name']}] {issue['type']}: {issue['message']}")
    raise ValueError("Data quality checks failed. Upstream data contract violated.")

License

This software is licensed under the MIT License.

Project details

Release history Release notifications | RSS feed

0.3.2

May 5, 2026

0.3.1

May 3, 2026

0.3.0

Apr 22, 2026

0.2.0

Apr 16, 2026

0.1.1

Apr 5, 2026

This version

0.1.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

netra_profiler-0.1.0.tar.gz (258.8 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

netra_profiler-0.1.0-py3-none-any.whl (37.5 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file netra_profiler-0.1.0.tar.gz.

File metadata

Download URL: netra_profiler-0.1.0.tar.gz
Upload date: Apr 5, 2026
Size: 258.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.3

File hashes

Hashes for netra_profiler-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`02511b41797468a82a89091f9064445e35f22aa49886961b445c89070f836061`
MD5	`a7f34e0baedfa997c27d3c60bb01aa6f`
BLAKE2b-256	`24a09686c53dfa4fcd3e624a73eecc8716a7ea80c12961a07fd6547b12a977dd`

See more details on using hashes here.

File details

Details for the file netra_profiler-0.1.0-py3-none-any.whl.

File metadata

Download URL: netra_profiler-0.1.0-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 37.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.3

File hashes

Hashes for netra_profiler-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e2522d9d55fa1c75d5a622c7462c6294db86659ddf636329e527ca1a1a5c966`
MD5	`2434892952fc772d01b0d812b7c30a0a`
BLAKE2b-256	`c4367e2e6e1f89d98f1e5e085253898812a8a118d66e254d3c8848c4f68a39e1`

See more details on using hashes here.

netra-profiler 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Netra Profiler

High-performance profiling and data quality tool built with Polars

Performance & The Data Envelope

A. Single-Node Workstation

Performance Takeaways

B. Cloud Scale-Up (Vertical Scaling)

C. Distributed Multi-Node (Horizontal Scaling)

Features

Installation

Quickstart

1. The CLI

Advanced Execution Options

2. The Python API

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes