Skip to main content

High-Performance DataFrame Engine powered by Rust (The PardoX Project)

Project description

PardoX — High-Performance DataFrame Engine

PyPI version License: MIT Python 3.8+ Powered By Rust

The Speed of Rust. The Simplicity of Python.

PardoX is a next-generation DataFrame engine for high-performance ETL, data analysis, and database integration. A single Rust core powers the entire computation layer — Python is just the interface.

v0.3.2 is now available. PRDX streaming to PostgreSQL (150M rows validated), GroupBy, String & Date ops, Window functions, Lazy pipeline, SQL over DataFrames, Encryption, Data Contracts, Time Travel, Arrow Flight, Distributed Cluster, Linear Algebra, REST Connector, Cloud Storage — 29 gap features total.


⚡ Why PardoX?

Capability How
Zero-copy ingestion Multi-threaded Rust CSV parser, data flows directly into HyperBlock buffers
SIMD arithmetic AVX2 / NEON instructions — 5x–20x faster than Python loops
Native database I/O Connect to PostgreSQL, MySQL, SQL Server, MongoDB — no psycopg2, no pymysql, no ORM
PRDX Streaming Stream 150M-row files to PostgreSQL at ~306k rows/s with O(block) RAM
GPU sort WebGPU Bitonic sort with transparent CPU fallback
GroupBy + Window Aggregations, rolling, rank, lag/lead — pure Rust
NumPy bridge Zero-copy np.array(df['col']) — direct pointer into Rust buffer
Zero dependencies Pure ctypes — no pip dependencies required
Cross-platform Linux x64 · Windows x64 · macOS Intel · macOS Apple Silicon

📦 Installation

pip install pardox

No Rust compiler. No C extensions to build. No database drivers to install.

Requirements: Python 3.8+


🚀 Quick Start

import pardox as px
from pardox.io import execute_sql

# Load 100,000 rows — parallel Rust CSV parser
df = px.read_csv("sales_data.csv")
print(f"Loaded {df.shape[0]:,} rows × {df.shape[1]} columns")

# SIMD-accelerated arithmetic
df.cast("quantity", "Float64")
df['revenue'] = df['price'] * df['quantity']

# Statistics — pure Rust, no NumPy needed
print(f"Total revenue : ${df['revenue'].sum():,.2f}")
print(f"Avg ticket    : ${df['revenue'].mean():,.2f}")

# GroupBy aggregation
grouped = df.groupby("state", {"revenue": "sum", "quantity": "mean"})

# Write to PostgreSQL — COPY FROM STDIN auto-activated for > 10k rows
CONN = "postgresql://user:password@localhost:5432/mydb"
execute_sql(CONN, "CREATE TABLE IF NOT EXISTS sales (price FLOAT, quantity FLOAT, revenue FLOAT)")
rows = df.to_sql(CONN, "sales", mode="append")
print(f"Written {rows:,} rows to PostgreSQL")

# Save locally — 4.6 GB/s read throughput
df.to_prdx("sales_processed.prdx")

🗄️ What's New in v0.3.2

PRDX Streaming to PostgreSQL

Stream a .prdx file directly to PostgreSQL — without loading the file into RAM. O(block) memory regardless of file size.

from pardox import write_sql_prdx

rows = write_sql_prdx(
    "ventas_150m.prdx",                         # .prdx file
    "postgresql://user:pass@localhost:5432/db",  # connection string
    "ventas",                                   # table (must already exist)
    mode="append",
    conflict_cols=[],
    batch_rows=1_000_000
)
print(f"Streamed {rows:,} rows")
# Validated: 150,000,000 rows / 3.8 GB in ~490s at ~306k rows/s
Approach RAM used
px.read_prdx() then df.to_sql() Entire file (3.8 GB for 150M rows)
write_sql_prdx() O(one block) — typically < 200 MB

GroupBy Aggregation

# Single aggregation
grouped = df.groupby("category", {"revenue": "sum"})

# Multiple aggregations
grouped = df.groupby("state", {
    "revenue":  "sum",
    "price":    "mean",
    "quantity": "count",
})

String & Date Operations

# String ops
df.str_upper("name")
df.str_lower("email")
df.str_trim("description")
df.str_contains("tag", "python")
df.str_replace("status", "old", "new")

# Date ops
df.date_extract("created_at", "year")   # → 'result_year'
df.date_format("created_at", "%Y-%m")
df.date_diff("end_date", "start_date")
df.date_add("created_at", 30, "day")

Window Functions

df.row_number("price")
df.rank("revenue", method="dense")
df.lag("price", 1)
df.lead("price", 1)
df.rolling_mean("price", 7)    # 7-period moving average

Lazy Pipeline

import pardox as px

# Scan without loading — filter and collect on demand
result = (
    px.scan_csv("large_file.csv")
    .select(["id", "price", "state"])
    .filter("price", ">", 100.0)
    .limit(10_000)
    .collect()
)
print(f"{result.shape[0]} rows")

SQL over DataFrames

# Run SQL directly on a DataFrame in memory
result = df.sql("SELECT state, SUM(revenue) as total FROM df GROUP BY state")

Encryption

import pardox as px

# Write encrypted PRDX
px.write_prdx_encrypted("secure.prdx", df, "my-secret-key")

# Read back
df = px.read_prdx_encrypted("secure.prdx", "my-secret-key")

Data Contracts

import json

contract = json.dumps({
    "source": "orders",
    "columns": {
        "price":    {"min": 0.0, "max": 10000.0},
        "status":   {"allowed_values": ["active", "pending", "closed"]},
        "customer": {"nullable": False},
    }
})

# Returns a new DataFrame with only conforming rows
clean = df.validate_contract(contract)
violations = df.contract_violation_count()
print(f"{violations} rows quarantined")

Time Travel

import pardox as px

# Save a versioned snapshot
px.version_write(df, "/data/snapshots", "v1", timestamp=0)

# Restore a snapshot by label
df_v1 = px.version_read("/data/snapshots", "v1")

# List available versions
versions = px.version_list("/data/snapshots")

Linear Algebra

# Cosine similarity between two DataFrame columns
sim = df.cosine_sim("embeddings", df2, "embeddings")

# L2 normalization
normed = df.l2_normalize("features")

# Matrix multiplication
result = df.matmul("A", df2, "B")

# PCA — reduce to N components
pca_df = df.pca("features", n_components=3)

Cloud Storage

import pardox as px

# Read CSV from S3, GCS, or Azure
df = px.read_cloud_csv(
    "s3://my-bucket/data.csv",
    schema={},
    config={},
    credentials={"access_key_id": "...", "secret_access_key": "..."}
)

REST Connector

import pardox as px

# Read from a REST endpoint directly into a DataFrame
df = px.read_rest("https://api.example.com/records", "GET", "{}")

🗄️ Database I/O

from pardox.io import (
    read_sql, execute_sql,                    # PostgreSQL
    read_mysql, execute_mysql,                # MySQL
    read_sqlserver, execute_sqlserver,        # SQL Server
    read_mongodb, execute_mongodb,            # MongoDB
)

# ── PostgreSQL ───────────────────────────────────────────────
PG = "postgresql://user:pass@localhost:5432/db"

df = read_sql(PG, "SELECT * FROM orders WHERE status = 'active'")
execute_sql(PG, "CREATE TABLE orders_archive (id BIGINT, amount FLOAT, region TEXT)")
rows = df.to_sql(PG, "orders_archive", mode="append")           # COPY FROM STDIN for > 10k
rows = df.to_sql(PG, "orders_archive", mode="upsert", conflict_cols=["id"])

# ── MySQL ────────────────────────────────────────────────────
MY = "mysql://user:pass@localhost:3306/db"

df = read_mysql(MY, "SELECT * FROM products WHERE active = 1")
execute_mysql(MY, "CREATE TABLE IF NOT EXISTS products_bak (id BIGINT, price DOUBLE)")
rows = df.to_mysql(MY, "products_bak", mode="append")
rows = df.to_mysql(MY, "products_bak", mode="upsert", conflict_cols=["id"])

# ── SQL Server ───────────────────────────────────────────────
MS = "Server=localhost,1433;Database=mydb;UID=sa;PWD=MyPwd;TrustServerCertificate=Yes"

df = read_sqlserver(MS, "SELECT TOP 5000 * FROM dbo.transactions")
rows = df.to_sqlserver(MS, "dbo.transactions_bak", mode="upsert", conflict_cols=["id"])

# ── MongoDB ──────────────────────────────────────────────────
MG = "mongodb://admin:pass@localhost:27017"

df = read_mongodb(MG, "mydb.orders")
rows = df.to_mongodb(MG, "mydb.orders_archive", mode="append")

Write modes:

Database append replace upsert
PostgreSQL INSERT (COPY for >10k) ON CONFLICT DO UPDATE
MySQL INSERT 1k/stmt (LOAD DATA for >10k) REPLACE INTO ON DUPLICATE KEY UPDATE
SQL Server INSERT 500/stmt INSERT 500/stmt MERGE INTO
MongoDB insert_many 10k/batch drop + insert_many

Note on SQL Server passwords: Avoid using ! in SQL Server passwords. A known issue in the tiberius v0.12 Rust driver causes authentication failure when ! is present. Use only [A-Za-z0-9_\-@#$]. Fix planned for v0.4.0.


📋 Full API Overview

Top-level functions

import pardox as px

df = px.read_csv("file.csv", schema={"price": "Float64"})
df = px.read_prdx("file.prdx")
df = px.from_arrow(arrow_table)            # zero-copy from PyArrow
df = px.scan_csv("file.csv").collect()     # lazy load
df = px.read_cloud_csv(url, schema, config, credentials)
df = px.read_rest(url, method, headers_json)
df = px.read_prdx_encrypted("file.prdx", "key")

rows = px.write_sql_prdx(path, conn, table, mode, conflict_cols, batch_rows)
px.write_prdx_encrypted("file.prdx", df, "key")

df = px.version_read(path, label)
labels = px.version_list(path)
px.version_write(df, path, label)

DataFrame — Properties & Inspection

df.shape          # (rows, cols)
df.columns        # ['col1', 'col2', ...]
df.dtypes         # {'col1': 'Float64', ...}
df.show(10)       # ASCII table preview
df.head(5)        # → DataFrame
df.tail(5)        # → DataFrame
df.iloc(0, 100)   # → DataFrame (rows 0-99)

DataFrame — Arithmetic & Transform

df['revenue'] = df['price'] * df['quantity']   # Series operators
df.cast("col", "Float64")
df.fillna(0.0)
df.round(2)
df.mul("price", "quantity")       # → DataFrame with 'result_mul'
df.sub("revenue", "cost")         # → DataFrame with 'result_sub'
df.min_max_scale("price")         # → DataFrame with 'result_minmax'
df.std("price")                   # float
df.sort_values("price", ascending=True, gpu=False)

DataFrame — GroupBy & Aggregation

df.groupby("category", {"revenue": "sum", "price": "mean"})
df.groupby("state", {"quantity": "count", "revenue": "max"})

DataFrame — Window Functions

df.row_number("price")
df.rank("revenue", method="dense")
df.lag("price", 1)
df.lead("price", 1)
df.rolling_mean("price", 7)

DataFrame — String & Date

df.str_upper("col")
df.str_lower("col")
df.str_trim("col")
df.str_contains("col", "pattern")
df.str_replace("col", "old", "new")

df.date_extract("col", "year")
df.date_format("col", "%Y-%m-%d")
df.date_diff("end", "start")
df.date_add("col", 30, "day")

DataFrame — Filtering & Join

mask = df['price'].gt(100.0)
df_filtered = df.filter(mask)

result = df.join(df2, on="customer_id")
result = df.join(df2, left_on="cust_id", right_on="id")

Series — Aggregations

df['col'].sum()    # float
df['col'].mean()   # float
df['col'].min()    # float
df['col'].max()    # float
df['col'].std()    # float
df['col'].count()  # int

Observer

df.value_counts("col")   # dict[str, int]
df.unique("col")         # list
df.to_dict()             # list[dict]
df.to_json()             # str

Write

df.to_prdx("out.prdx")
df.to_csv("out.csv")
df.to_sql(conn, "table", mode="append", conflict_cols=[])
df.to_mysql(conn, "table", mode="upsert", conflict_cols=["id"])
df.to_sqlserver(conn, "dbo.table", mode="append")
df.to_mongodb(conn, "db.collection", mode="append")
px.write_sql_prdx("file.prdx", conn, "table", mode="append", conflict_cols=[], batch_rows=1_000_000)

NumPy Zero-Copy Bridge

import numpy as np

arr = np.array(df["price"])   # dtype: float64 — direct pointer into Rust buffer

# Compatible with Scikit-Learn out of the box
from sklearn.linear_model import LinearRegression
X = np.column_stack([np.array(df["price"]), np.array(df["quantity"])])
y = np.array(df["revenue"])
model = LinearRegression().fit(X, y)

📊 Benchmarks

Hardware: MacBook Pro M2, 16 GB RAM.

Operation Pandas v2.x PardoX v0.3.2 Speedup
Read CSV (1 GB) 4.2s 0.8s 5.2x
Column multiply 0.15s 0.02s 7.5x
Fill NA 0.30s 0.04s 7.5x
Read binary 0.9s (Parquet) 0.2s (.prdx) 4.5x
PostgreSQL write 50k rows ~18s (psycopg2) ~0.6s (COPY) ~30x
MySQL write 50k rows ~22s (pymysql) ~3s (batch INSERT) ~7x
PRDX → PostgreSQL 150M rows N/A ~490s 306k rows/s

🗺️ Roadmap

Version Status Highlights
v0.1 ✅ Released CSV, arithmetic, aggregations, .prdx format
v0.3.1 ✅ Released Databases (PG/MySQL/MSSQL/MongoDB), Observer, Math, GPU sort, NumPy bridge
v0.3.2 ✅ Released PRDX Streaming, GroupBy, Window, String/Date, Lazy, SQL over DF, Encryption, Data Contracts, Time Travel, Arrow Flight, Distributed Cluster, Linear Algebra, REST Connector, Cloud Storage — 29 features
v0.4.0 🔜 Planned SQL Server ! password fix, structured error codes, Apache Parquet, Kafka, S3

🌐 Platform Support

OS Architecture Status
Linux x86_64 ✅ Stable
Windows x86_64 ✅ Stable
macOS ARM64 (M1/M2/M3) ✅ Stable
macOS x86_64 (Intel) ✅ Stable

📘 Documentation

Full Documentation →


📄 License

MIT License — free for commercial and personal use.


by Alberto Cardenas
www.albertocardenas.com · www.pardox.io

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pardox-0.3.2.tar.gz (93.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pardox-0.3.2-py3-none-any.whl (94.1 MB view details)

Uploaded Python 3

File details

Details for the file pardox-0.3.2.tar.gz.

File metadata

  • Download URL: pardox-0.3.2.tar.gz
  • Upload date:
  • Size: 93.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pardox-0.3.2.tar.gz
Algorithm Hash digest
SHA256 868b2d0b996c26ddc6f6ef80366c25163722b425825b8915e0df64ba24f75273
MD5 b05d700190364cdaebb4b161f8d3b39b
BLAKE2b-256 a417b58eace7088b0102128db607c6fa8fd47965f5f5b68978c0706c824b9e88

See more details on using hashes here.

File details

Details for the file pardox-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: pardox-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 94.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pardox-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c7ab0ffe35711c2b2743be2a100619605fd59594625f08a400f07a3c6be2d5e2
MD5 5595aeb5f857c98aed1b46bb88ce0924
BLAKE2b-256 f50bc50336b05ca93c54be92627aff2751baed804ef477f7d1360c08f6572e36

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page