Skip to main content

High-Performance DataFrame Engine powered by Rust (The PardoX Project)

Project description

PardoX — High-Performance DataFrame Engine

PyPI version License: MIT Python 3.8+ Powered By Rust

The Speed of Rust. The Simplicity of Python.

PardoX is a next-generation DataFrame engine for high-performance ETL, data analysis, and database integration. A single Rust core powers the entire computation layer — Python is just the interface.

v0.3.4 is now available. PRDX streaming to PostgreSQL (150M rows validated), GroupBy, String & Date ops, Window functions, Lazy pipeline, SQL over DataFrames, Encryption, Data Contracts, Time Travel, Arrow Flight, Distributed Cluster, Linear Algebra, REST Connector, Cloud Storage — 29 gap features total.


⚡ Why PardoX?

Capability How
Zero-copy ingestion Multi-threaded Rust CSV parser, data flows directly into HyperBlock buffers
SIMD arithmetic AVX2 / NEON instructions — 5x–20x faster than Python loops
Native database I/O Connect to PostgreSQL, MySQL, SQL Server, MongoDB — no psycopg2, no pymysql, no ORM
PRDX Streaming Stream 150M-row files to PostgreSQL at ~306k rows/s with O(block) RAM
GPU sort WebGPU Bitonic sort with transparent CPU fallback
GroupBy + Window Aggregations, rolling, rank, lag/lead — pure Rust
NumPy bridge Zero-copy np.array(df['col']) — direct pointer into Rust buffer
Zero dependencies Pure ctypes — no pip dependencies required
Cross-platform Linux x64 · Windows x64 · macOS Intel · macOS Apple Silicon

📦 Installation

pip install pardox

No Rust compiler. No C extensions to build. No database drivers to install.

Requirements: Python 3.8+


🚀 Quick Start

import pardox as px
from pardox.io import execute_sql

# Load 100,000 rows — parallel Rust CSV parser
df = px.read_csv("sales_data.csv")
print(f"Loaded {df.shape[0]:,} rows × {df.shape[1]} columns")

# SIMD-accelerated arithmetic
df.cast("quantity", "Float64")
df['revenue'] = df['price'] * df['quantity']

# Statistics — pure Rust, no NumPy needed
print(f"Total revenue : ${df['revenue'].sum():,.2f}")
print(f"Avg ticket    : ${df['revenue'].mean():,.2f}")

# GroupBy aggregation
grouped = df.groupby("state", {"revenue": "sum", "quantity": "mean"})

# Write to PostgreSQL — COPY FROM STDIN auto-activated for > 10k rows
CONN = "postgresql://user:password@localhost:5432/mydb"
execute_sql(CONN, "CREATE TABLE IF NOT EXISTS sales (price FLOAT, quantity FLOAT, revenue FLOAT)")
rows = df.to_sql(CONN, "sales", mode="append")
print(f"Written {rows:,} rows to PostgreSQL")

# Save locally — 4.6 GB/s read throughput
df.to_prdx("sales_processed.prdx")

🗄️ What's New in v0.3.4

PRDX Streaming to PostgreSQL

Stream a .prdx file directly to PostgreSQL — without loading the file into RAM. O(block) memory regardless of file size.

from pardox import write_sql_prdx

rows = write_sql_prdx(
    "ventas_150m.prdx",                         # .prdx file
    "postgresql://user:pass@localhost:5432/db",  # connection string
    "ventas",                                   # table (must already exist)
    mode="append",
    conflict_cols=[],
    batch_rows=1_000_000
)
print(f"Streamed {rows:,} rows")
# Validated: 150,000,000 rows / 3.8 GB in ~490s at ~306k rows/s
Approach RAM used
px.read_prdx() then df.to_sql() Entire file (3.8 GB for 150M rows)
write_sql_prdx() O(one block) — typically < 200 MB

GroupBy Aggregation

# Single aggregation
grouped = df.groupby("category", {"revenue": "sum"})

# Multiple aggregations
grouped = df.groupby("state", {
    "revenue":  "sum",
    "price":    "mean",
    "quantity": "count",
})

String & Date Operations

# String ops
df.str_upper("name")
df.str_lower("email")
df.str_trim("description")
df.str_contains("tag", "python")
df.str_replace("status", "old", "new")

# Date ops
df.date_extract("created_at", "year")   # → 'result_year'
df.date_format("created_at", "%Y-%m")
df.date_diff("end_date", "start_date")
df.date_add("created_at", 30, "day")

Window Functions

df.row_number("price")
df.rank("revenue", method="dense")
df.lag("price", 1)
df.lead("price", 1)
df.rolling_mean("price", 7)    # 7-period moving average

Lazy Pipeline

import pardox as px

# Scan without loading — filter and collect on demand
result = (
    px.scan_csv("large_file.csv")
    .select(["id", "price", "state"])
    .filter("price", ">", 100.0)
    .limit(10_000)
    .collect()
)
print(f"{result.shape[0]} rows")

SQL over DataFrames

# Run SQL directly on a DataFrame in memory
result = df.sql("SELECT state, SUM(revenue) as total FROM df GROUP BY state")

Out-of-Core Processing (Large Datasets > RAM)

Stream a SQL table to a .prdx file with O(batch) RAM, then run analytics on it without ever loading the full dataset — ideal for 100M+ row workloads.

import pardox as px

# Step 1: Stream SQL table to disk (O(batch) RAM — ~200 MB regardless of table size)
rows = px.write_sql_prdx(
    "orders.prdx",
    "postgresql://user:pass@localhost:5432/db",
    "orders",
    mode="append",
    conflict_cols=[],
    batch_rows=1_000_000
)
print(f"Saved {rows:,} rows")   # validated: 150M rows / ~490s / ~306k rows/s

# Step 2: GroupBy directly on .prdx — memory = O(distinct groups), not O(rows)
result = px.prdx_groupby(
    "orders.prdx",
    ["product_id", "region"],
    {"amount": "sum", "qty": "count"}
)
result.show(20)

# Aggregate statistics without loading the file
total  = px.prdx_count("orders.prdx")
avg    = px.prdx_mean("orders.prdx",  "amount")
top    = px.prdx_max("orders.prdx",   "amount")
bottom = px.prdx_min("orders.prdx",   "amount")

In-memory chunked processing (when data is already loaded):

# GroupBy in configurable chunks — peak RAM = chunk_size × row_bytes
result = df.chunked_groupby(
    "product_id",
    {"amount": "sum", "qty": "count"},
    chunk_size=1_000_000
)

# External merge sort — handles datasets larger than RAM
sorted_df = df.external_sort("amount", ascending=False, chunk_size=1_000_000)

# Spill manager to disk and restore
df.spill_to_disk("/tmp/orders_spill")
df2 = px.DataFrame.spill_from_disk("/tmp/orders_spill")

# Current memory usage
print(f"RSS: {px.DataFrame.memory_usage() / 1e6:.1f} MB")
Approach RAM used
read_sql() on 100M rows ~2–4 GB (full result in memory)
write_sql_prdx() + prdx_groupby() O(batch) write + O(groups) query
chunked_groupby() O(chunk_size × row_bytes)

Encryption

import pardox as px

# Write encrypted PRDX
px.write_prdx_encrypted("secure.prdx", df, "my-secret-key")

# Read back
df = px.read_prdx_encrypted("secure.prdx", "my-secret-key")

Data Contracts

import json

contract = json.dumps({
    "source": "orders",
    "columns": {
        "price":    {"min": 0.0, "max": 10000.0},
        "status":   {"allowed_values": ["active", "pending", "closed"]},
        "customer": {"nullable": False},
    }
})

# Returns a new DataFrame with only conforming rows
clean = df.validate_contract(contract)
violations = df.contract_violation_count()
print(f"{violations} rows quarantined")

Time Travel

import pardox as px

# Save a versioned snapshot
px.version_write(df, "/data/snapshots", "v1", timestamp=0)

# Restore a snapshot by label
df_v1 = px.version_read("/data/snapshots", "v1")

# List available versions
versions = px.version_list("/data/snapshots")

Linear Algebra

# Cosine similarity between two DataFrame columns
sim = df.cosine_sim("embeddings", df2, "embeddings")

# L2 normalization
normed = df.l2_normalize("features")

# Matrix multiplication
result = df.matmul("A", df2, "B")

# PCA — reduce to N components
pca_df = df.pca("features", n_components=3)

Cloud Storage

import pardox as px

# Read CSV from S3, GCS, or Azure
df = px.read_cloud_csv(
    "s3://my-bucket/data.csv",
    schema={},
    config={},
    credentials={"access_key_id": "...", "secret_access_key": "..."}
)

REST Connector

import pardox as px

# Read from a REST endpoint directly into a DataFrame
df = px.read_rest("https://api.example.com/records", "GET", "{}")

🗄️ Database I/O

from pardox.io import (
    read_sql, execute_sql,                    # PostgreSQL
    read_mysql, execute_mysql,                # MySQL
    read_sqlserver, execute_sqlserver,        # SQL Server
    read_mongodb, execute_mongodb,            # MongoDB
)

# ── PostgreSQL ───────────────────────────────────────────────
PG = "postgresql://user:pass@localhost:5432/db"

df = read_sql(PG, "SELECT * FROM orders WHERE status = 'active'")
execute_sql(PG, "CREATE TABLE orders_archive (id BIGINT, amount FLOAT, region TEXT)")
rows = df.to_sql(PG, "orders_archive", mode="append")           # COPY FROM STDIN for > 10k
rows = df.to_sql(PG, "orders_archive", mode="upsert", conflict_cols=["id"])

# ── MySQL ────────────────────────────────────────────────────
MY = "mysql://user:pass@localhost:3306/db"

df = read_mysql(MY, "SELECT * FROM products WHERE active = 1")
execute_mysql(MY, "CREATE TABLE IF NOT EXISTS products_bak (id BIGINT, price DOUBLE)")
rows = df.to_mysql(MY, "products_bak", mode="append")
rows = df.to_mysql(MY, "products_bak", mode="upsert", conflict_cols=["id"])

# ── SQL Server ───────────────────────────────────────────────
MS = "Server=localhost,1433;Database=mydb;UID=sa;PWD=MyPwd;TrustServerCertificate=Yes"

df = read_sqlserver(MS, "SELECT TOP 5000 * FROM dbo.transactions")
rows = df.to_sqlserver(MS, "dbo.transactions_bak", mode="upsert", conflict_cols=["id"])

# ── MongoDB ──────────────────────────────────────────────────
MG = "mongodb://admin:pass@localhost:27017"

df = read_mongodb(MG, "mydb.orders")
rows = df.to_mongodb(MG, "mydb.orders_archive", mode="append")

Write modes:

Database append replace upsert
PostgreSQL INSERT (COPY for >10k) ON CONFLICT DO UPDATE
MySQL INSERT 1k/stmt (LOAD DATA for >10k) REPLACE INTO ON DUPLICATE KEY UPDATE
SQL Server INSERT 500/stmt INSERT 500/stmt MERGE INTO
MongoDB insert_many 10k/batch drop + insert_many

Note on SQL Server passwords: Avoid using ! in SQL Server passwords. A known issue in the tiberius v0.12 Rust driver causes authentication failure when ! is present. Use only [A-Za-z0-9_\-@#$].


📋 Full API Overview

Top-level functions

import pardox as px

df = px.read_csv("file.csv", schema={"price": "Float64"})
df = px.read_prdx("file.prdx")
df = px.from_arrow(arrow_table)            # zero-copy from PyArrow
df = px.scan_csv("file.csv").collect()     # lazy load
df = px.read_cloud_csv(url, schema, config, credentials)
df = px.read_rest(url, method, headers_json)
df = px.read_prdx_encrypted("file.prdx", "key")

rows = px.write_sql_prdx(path, conn, table, mode, conflict_cols, batch_rows)
px.write_prdx_encrypted("file.prdx", df, "key")

df = px.version_read(path, label)
labels = px.version_list(path)
px.version_write(df, path, label)

# Out-of-core / streaming PRDX
result  = px.prdx_groupby("file.prdx", ["col1"], {"col2": "sum"})
total   = px.prdx_count("file.prdx")
avg     = px.prdx_mean("file.prdx", "col")
maximum = px.prdx_max("file.prdx", "col")
minimum = px.prdx_min("file.prdx", "col")

DataFrame — Properties & Inspection

df.shape          # (rows, cols)
df.columns        # ['col1', 'col2', ...]
df.dtypes         # {'col1': 'Float64', ...}
df.show(10)       # ASCII table preview
df.head(5)        # → DataFrame
df.tail(5)        # → DataFrame
df.iloc(0, 100)   # → DataFrame (rows 0-99)

DataFrame — Arithmetic & Transform

df['revenue'] = df['price'] * df['quantity']   # Series operators
df.cast("col", "Float64")
df.fillna(0.0)
df.round(2)
df.mul("price", "quantity")       # → DataFrame with 'result_mul'
df.sub("revenue", "cost")         # → DataFrame with 'result_sub'
df.min_max_scale("price")         # → DataFrame with 'result_minmax'
df.std("price")                   # float
df.sort_values("price", ascending=True, gpu=False)

DataFrame — Out-of-Core

df.chunked_groupby("col", {"val": "sum"}, chunk_size=1_000_000)
df.external_sort("col", ascending=True, chunk_size=1_000_000)
df.spill_to_disk("/tmp/spill_path")
px.DataFrame.spill_from_disk("/tmp/spill_path")    # → DataFrame
px.DataFrame.memory_usage()                         # → bytes (RSS)

DataFrame — GroupBy & Aggregation

df.groupby("category", {"revenue": "sum", "price": "mean"})
df.groupby("state", {"quantity": "count", "revenue": "max"})

DataFrame — Window Functions

df.row_number("price")
df.rank("revenue", method="dense")
df.lag("price", 1)
df.lead("price", 1)
df.rolling_mean("price", 7)

DataFrame — String & Date

df.str_upper("col")
df.str_lower("col")
df.str_trim("col")
df.str_contains("col", "pattern")
df.str_replace("col", "old", "new")

df.date_extract("col", "year")
df.date_format("col", "%Y-%m-%d")
df.date_diff("end", "start")
df.date_add("col", 30, "day")

DataFrame — Filtering & Join

mask = df['price'].gt(100.0)
df_filtered = df.filter(mask)

result = df.join(df2, on="customer_id")
result = df.join(df2, left_on="cust_id", right_on="id")

Series — Aggregations

df['col'].sum()    # float
df['col'].mean()   # float
df['col'].min()    # float
df['col'].max()    # float
df['col'].std()    # float
df['col'].count()  # int

Observer

df.value_counts("col")   # dict[str, int]
df.unique("col")         # list
df.to_dict()             # list[dict]
df.to_json()             # str

Write

df.to_prdx("out.prdx")
df.to_csv("out.csv")
df.to_sql(conn, "table", mode="append", conflict_cols=[])
df.to_mysql(conn, "table", mode="upsert", conflict_cols=["id"])
df.to_sqlserver(conn, "dbo.table", mode="append")
df.to_mongodb(conn, "db.collection", mode="append")
px.write_sql_prdx("file.prdx", conn, "table", mode="append", conflict_cols=[], batch_rows=1_000_000)

NumPy Zero-Copy Bridge

import numpy as np

arr = np.array(df["price"])   # dtype: float64 — direct pointer into Rust buffer

# Compatible with Scikit-Learn out of the box
from sklearn.linear_model import LinearRegression
X = np.column_stack([np.array(df["price"]), np.array(df["quantity"])])
y = np.array(df["revenue"])
model = LinearRegression().fit(X, y)

📊 Benchmarks

Hardware: MacBook Pro M2, 16 GB RAM.

Operation Pandas v2.x PardoX v0.3.4 Speedup
Read CSV (1 GB) 4.2s 0.8s 5.2x
Column multiply 0.15s 0.02s 7.5x
Fill NA 0.30s 0.04s 7.5x
Read binary 0.9s (Parquet) 0.2s (.prdx) 4.5x
PostgreSQL write 50k rows ~18s (psycopg2) ~0.6s (COPY) ~30x
MySQL write 50k rows ~22s (pymysql) ~3s (batch INSERT) ~7x
PRDX → PostgreSQL 150M rows N/A ~490s 306k rows/s

🗺️ Roadmap

Version Status Highlights
v0.3.4 ✅ Current PRDX Streaming, GroupBy, Window, String/Date, Lazy, SQL over DF, Encryption, Data Contracts, Time Travel, Arrow Flight, Distributed Cluster, Linear Algebra, REST Connector, Cloud Storage, Out-of-Core Processing — 29 features

🌐 Platform Support

OS Architecture Status
Linux x86_64 ✅ Stable
Windows x86_64 ✅ Stable
macOS ARM64 (M1/M2/M3) ✅ Stable
macOS x86_64 (Intel) ✅ Stable

📘 Documentation

Full Documentation →


📄 License

MIT License — free for commercial and personal use.


by Alberto Cardenas
www.albertocardenas.com · www.pardox.io

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pardox-0.3.4.tar.gz (93.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pardox-0.3.4-py3-none-any.whl (94.1 MB view details)

Uploaded Python 3

File details

Details for the file pardox-0.3.4.tar.gz.

File metadata

  • Download URL: pardox-0.3.4.tar.gz
  • Upload date:
  • Size: 93.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pardox-0.3.4.tar.gz
Algorithm Hash digest
SHA256 26049167f5e32496f5bb62ca1be0b28289281548763cb7bf862edc79767e63ed
MD5 612a700bb68f89b6bc38a2d7411501ad
BLAKE2b-256 ebef88318d5bdbedbf1d3472236cec345feba75e1ba63345bda4d1235bd5bb90

See more details on using hashes here.

File details

Details for the file pardox-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: pardox-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 94.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pardox-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 00654270a6478f546e62ef3ab45ace311041f41c2668583e028fac826bf0d2d2
MD5 9a3620b557d3985e67cce4a3c84d1a90
BLAKE2b-256 ee24f32b596532f2d5f14b8a087d2107eb8cee90b199988dcaf194076213fb93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page