Skip to main content

Enterprise-grade Headless ETL Engine with Interactive UI

Project description

⚡ PyQuery: The Main Character of Data Stacks 💫

ETL. EDA. ML. SQL. IDE.

Execution Mode Privacy Design Stack

PyPI Version Python Versions License

Rows of Data

🚩 Stop letting Pandas hold you back.
The single-threaded era is over.

PyQuery is a local-first data operating system that auto-heals broken CSVs, includes a native Code Editor, and processes 100GB+ files without breaking a sweat. ⚡

Feature Request · Report Bug


🎮 The Ecosystem (Choose Your Path)

We built a suite of tools so perfect it hurts.

Path Vibe Description Link
CLI 🏎️ Speedrun The Headless Beast. Run data pipelines in your sleep. CLI Manual
UI 🎨 Creative The Visual Studio. Drag, drop, analyze, visualize. UI Guide
API 📡 Backbone The Server. Build your own apps on our engine. API Docs
SDK 🐍 Sorcery The Python Library. For the code wizards. SDK Guide

🧠 TL;DR (For the goldfish attention spans)

✨ New Drop: Headless Ghost Mode 👻 PyQuery now supports total Headless Automation. Run massive pipelines in CI/CD, schedule tasks, and bypass the UI entirely with the re-architected run command.

  1. Install it: pip install pyquery-polars (Don't be basic).
  2. Run it: pyquery ui (Visuals) or pyquery run (Speedrun/Headless).
  3. The Flex: It's a local-first, privacy-focused engine that eats Excel sheets and CSVs for breakfast using Rust.

⛩️ The Awakening (Lore)

Long ago, the Data World was mid. Analysts lived in fear of the MemoryError. They bowed before the single-threaded tyranny of the Old Gods (Pandas). They accepted their fate of freezing screens, crashing kernels, and waiting 4 hours for a simple groupby.

But I refused.

From the depths of the Rusty abyss, PyQuery has awakened. I am not just an ETL tool anymore. I am the entire war room. I am here to obliterate your bottlenecks and ratio your old benchmarks.

The Core Philosophy (Our Ninja Way) 🥷

  • Lazy Execution: Nothing computes until you say "Export". This optimizes memory and speed so your hardware doesn't scream.
  • Zero-Copy: Data is processed efficiently without redundant copies. We don't waste bits.
  • Strict & Clean: Enforces strict typing and argument validation. No ambiguous magic, just pure logic.
  • Automation First: While the UI is gorgeous, PyQuery is built to run alone in the dark.

Welcome to your Villain Arc. 👹


🧾 PyQuery vs. Power Query: The Roast

We don't usually punch down, but you handed us the gloves.

Feature PyQuery (The Chad) 🐢 Power Query (The Virgin)
Speed Rust-Powered. Processes millions of rows before you blink. Single-Threaded. Spends 20 mins saying "Loading Data..." just to crash.
Language Python/SQL/Polars. The languages of gods. M-Code. A language invented to punish humanity.
AI/ML Built-in. Random Forests, Clustering, & Monte Carlo Sims. Non-existent. You need a generic "AI Plugin" that costs extra.
Vibe Dark Mode CLI & Streamlit. Cyberpunk aesthetic. Corporate Grey. It sucks the soul out of your body.
Price Free & Open Source. Requires an Office 365 License (Subscription L).
Boot XP Cinematic CLI with Themes & Logs Static Spinner of Doom
Broken CSVs Auto-healed at ingest Crashes silently
One Bad File Isolated & corrected Pipeline dead
Headless Full CLI Automation. Designed for CI/CD pipelines. UI Dependent. Good luck automating that in a Linux shell.

🖥️ The Main Character CLI (The Experience)

This is not a command line. This is a startup ritual.

Every time PyQuery boots, it behaves like a data OS coming online.

⚡ Adaptive Theme Engine

The CLI dynamically switches color gradients, borders, and mood based on your selected boot mode. Each theme announces itself during startup. You feel it before you run anything.

  • Cyberpunk: (Default) Neon main-character energy.
  • Rustacean: Pure Polars lore.
  • Matrix: Hacker-core, green text supremacy.
  • Villain Arc: Purple & gold. No mercy.

👻 Headless Revamp: The run Command

The CLI has been completely re-architected for Automation Supremacy. The run command is your primary entry point for headless operations.

# Basic Speedrun
pyquery run --source data.csv --output results.parquet

# Project Mode (Load the whole squad)
pyquery run --project daily_report.pyquery --output dist/
🛠️ Execution Modes:
  • Source Mode (--source): Quick ad-hoc processing of single files, SQL queries, or APIs.
  • Project Mode (--project): Load a predefined .pyquery project file containing multiple datasets and recipes.

Note: These flags are mutually exclusive. Choose your path.

📟 Sequential Boot Logs

Real-time kernel-style logs with cinematic pacing. It doesn’t say "loading"... It declares intent.

  • Timestamped steps.
  • Module icons (⚡ Engine, 💾 IO, 🧠 Planner).
  • Your terminal doesn’t just start PyQuery. It witnesses it.

🧩 Focused UI (Modal Upgrade)

Sidebars are for tourists. PyQuery loads data through dedicated modal dialogs—because loading data is a moment, not a side quest.

  • Blazing-Fast & Optimistic: The dialog opens instantly.
  • Lazy Preview: We scan 100k+ files without freezing the UI.
  • Recent Paths: We remember so you don't have to.
  • Preview Before Commit: See matched files and sheets before you import. You don't guess anymore; you confirm with intent.

💪 The Flex (Capabilities)

We built an empire so you can rule yours. This isn't just software; it's a lifestyle.

🎯 EDA: The Crystal Ball (Expanded)

"Most tools describe the past. PyQuery predicts the future."

EDA is no longer just "looking at data". It's hunting.

1. 🧬 Dataset DNA & Health Check

We scan your data's soul.

  • Missing Cells: We don't just count nulls; we judge them. (<1% is excellence, >10% is sloppy).
  • Cardinality Checks: Instantly know if a column is categorical or continuous.
  • Duplicate Detection: We find the clones and eliminate them.

2. 🚀 The Action Engine (ML Strategist)

  • Strategic Brief: A "Top 3 Insights" card that ranks every signal in your data. It whispers: "The money is here."
  • Automated Drivers: It finds the hidden variables controlling your target.
    • "Why is Churn high? It's not Price. It's Customer Support Wait Time > 5m." -> Boom. Solved.
  • Correlation Matrix: Pearson, Cramer’s V, and F-Tests calculated automatically. We know the relationships better than you know your own situationship.

3. 🧪 ML Laboratory (The Brain)

  • Auto-Pilot Mode: Trains an army of models (Random Forest, Lasso, Ridge) to find the best fit. You sit back and look busy.
  • Clustering (Unsupervised Rizz): Elbow Plots & Silhouette Scores optimization. We even name the segments for you ("Cluster 1 = High Spend, Low Age").
  • Explainable Anomalies: Uses Isolation Forests to catch the weirdos and fraudsters instantly, with a Contextual Profiler to tell you why they are weird.

4. 🎮 Decision Simulator (The Time Machine)

  • "What-If" Sliders: Change variables in real-time. "If I raise Price by 10% and lower ad spend, do I still profit?"
  • Monte Carlo Sims: Run 1,000+ simulations. We don't guess; we calculate the probability of your success.
  • Waterfall Analysis: The Model breaks down exactly why the prediction changed.

5. 📈 Time Series & Visuals That Don't Miss

  • Holt-Winters Forecasting: Predicting the future with confidence intervals.
  • Decomposition: Splitting data into Trend, Seasonality, and Noise.
  • Cohort Comparison: Volcano Plots visualizing "Effect Size" vs "Significance." We bring the science.

💻 The Integrated IDE (Code is Power)

For those who speak the language of the gods (Python/SQL), we built a React-based Code Editor right inside the UI.

  • Embedded Ace Editor: Syntax highlighting, line numbers, and active line focus. Feels like VS Code, lives in your browser.
  • Intelligent Auto-Completions: Context-aware suggestions for pl, np, math. Type col get col("name"). It knows your schema.
  • Sandboxed Custom Scripts:
    • AST-Validated Security: We parse your code before execution.
    • Blocked: import os, private attributes, system calls.
    • Allowed: numpy, scipy, sklearn. Pure math and logic only.

🧪 SQL Lab: The Codex (God Mode)

For when the GUI is too easy and you want to flex raw SQL. This isn't SQLite. This is High-Performance Lazy SQL.

  • Zero-Lag Querying: Run SELECT * on a 50GB file? It pulls a preview instantly. The engine effectively cheats physics.
  • Cross-Dataset Joins: Join sales.csv with targets.xlsx using standard SQL.
  • Materialize: Execute complex queries, then save as a new dataset.

🧹 The Forge (Ruthless ETL)

Backend I/O that actually understands real-world data. Real data is cursed. We planned for that.

  • 🧬 Advanced Auto-Encoding Healer:
    • Scans the first bytes of every CSV to automatically fix UnicodeDecodeError.
    • Stream-Based Healing: Processes multi-GB files in 4MB chunks. Memory usage stays flat.
    • Sanitization: Strips Null Bytes, normalizes newlines, and replaces garbage.
  • 🧩 Mixed-Encoding Folder Handling:
    • If a folder contains files with different encodings, PyQuery detects it and switches strategy automatically.
    • We isolate. We adapt. We continue.
  • 📂 Recursive Folder Globbing (Upgraded):
    • Patterns like data/**/*.csv work even when schemas differ slightly or headers are misaligned.
  • 🏗️ Staging Ground (Infrastructure Rizz):
    • Control your intermediate storage. If your %TEMP% partition is small, tell PyQuery where the real space is using the PYQUERY_STAGING_DIR environment variable.
    # Linux/Mac Power Move
    export PYQUERY_STAGING_DIR="/mnt/fast_ssd/pyquery_cache"
    pyquery run ...
    
  • 🔍 Advanced File Filtering (Precision Strikes):
    • Multiple Filter Types: Glob, Regex, Contains, Not Contains, Exact, Is Not.
    • Stackable Logic: Must contain sales + Must NOT contain backup + Must match regex \d{4}.
    • This is surgical file selection. No more loading junk and cleaning later.
  • 📊 Excel Handling That Respects Your Sanity:
    • Multi-Sheet Selection: Load one sheet, many sheets, or only the ones that matter.
    • Template-Based Mapping: Pick a base file, preview its sheets, and apply that selection across all matching files.
    • Sheet Name Filtering: Regex-powered selection like Q[1-4]_Data.
  • ✨ Source Awareness & Cleanliness:
    • Metadata Injection: Automatically add __source_path__ and __source_name__.
    • Auto Type Inference: Samples data, infers dtypes, and instantly appends a Clean & Cast step.
  • ✨ Auto-Typecast: One click scans rows and forcibly converts Strings to Int, Float, or Date.
  • 🎭 PII Incinerator: Detects and obfuscates credit cards and SSNs. Secrets remain secret.
  • 🩹 Smart Impute: Fill the voids. Forward fill, backward fill, median, or specific value injection. No null survives.
  • 💥 Explode & Coalesce: Flatten lists and merge columns like a boss.

🧠 The Tech Stack (Forbidden Knowledge) 🐐

This isn't just a library. It's a weapon system.

1. 🌊 The "Infinite Stream" Glitch (Lazy Execution)

The Old Gods (Pandas) are Eager. They try to swallow the ocean (RAM) whole. They choke. PyQuery is Lazy. It waits. It plans.

  • Scan: "It's a 100GB file. Interesting."
  • Plan: Filters, joins, math. Nothing executes until the final blow.
  • Stream: Data flows in chunks. Process. Write. Destroy.
  • Result: Processing 100GB on a MacBook Air. The laws of physics are optional.

2. ⚙️ File-Level Execution Control

Most engines think in datasets. PyQuery thinks in files.

  • Individual File Processing: Forces the engine to load files one-by-one instead of bulk scanning.
  • Why it matters: One corrupted CSV no longer nukes the entire pipeline. We fix schemas and clean data before concatenation. This is how PyQuery survives enterprise-grade mess.

3. 🚀 Streaming I/O Architecture

We rewired the backend for scale.

  • True Streaming Discovery: Uses generators and lazy iteration. Point at 100k files without crashing.
  • Partial Globbing: Simple text filters convert to filesystem-level globs. Python never even sees irrelevant files.

4. 🛡️ Type Safety (Absolute Order)

Python is dynamic (chaotic). PyQuery imposes Order.

  • Every step is backed by a Pydantic Model.
  • If a String tries to infiltrate a Float column, it is terminated before execution.
  • No runtime surprises. Only calculated victories.

🧾 The Receipts (Benchmarks)

We don't post without proof. We mog the competition.

Metric 🐼 Pandas (Legacy) ⚡ PyQuery (Polars) The Diff
Load 10GB CSV MemoryError (Crash) 💥 0.2s (Lazy Scan) ⚡ Infinite
Filter Rows 15.4s (Slow) 0.5s (Parallel) 30x Faster
Group By 45s (Painful) 2.1s (Instant) 20x Faster
RAM Usage 12GB+ (Bloated) 500MB (Lean) 95% Less

Benchmarks run on a standard dev laptop. Results may vary but the vibe remains consistent.


🎮 Choose Your Fighter (4 Paths to Power)

We don't limit you. Dominate however you choose.

📦 Installation

pip install pyquery-polars

1. 🌊 The GUI (God Mode)

For when you want to click things, see pretty charts, and feel like a data scientist in a sci-fi movie.

  • Visual Recipe Builder: Nodes and edges of pure logic.
  • Native File Picker: Access local filesystem directly.
pyquery ui
# Launches the Web App on localhost:8501 🚀

2. 🤖 The API (Headless Beast)

Building a machine? Run PyQuery as the engine.

  • Swagger Docs: Auto-generated at /docs.
  • Async: Fire and forget jobs via POST /recipes/run.
pyquery api
# Serving high-performance ETL over HTTP at localhost:8000 📡

3. ⚡ The Batch Runner (Speedrun)

For automation. No interface. Just speed.

pyquery run -s input.csv -r recipe.json -o output.parquet
# Task complete. ⚡

4. 🧙‍♂️ The Sorcerer (Python SDK)

For the developers who want to weave PyQuery into their own code.

from pyquery_polars.backend.engine import PyQueryEngine
# Full programmatic control over the recipe engine.
# You are the architect now.

🧰 The Loadout (Arsenal)

Packed with every tool needed to clear the map.

Category The Tools Why it slaps
Cleaning Fill Nulls, Mask PII, Smart Extract, Regex Turns garbage data into gold. ✨
Analytics Rolling Agg, Time Bin, Rank, Diff, Z-Score High-frequency trading vibes. 📈
Combining Smart Join, Concat, Pivot, Unpivot Merge datasets without the headache. 🤝
Math Log, Exp, Clip, Date Offset For the scientific girlies. 👩‍🔬
Text Slice, Case, Replace, One-Hot String manipulation on steroids. 💪
I/O CSV, Parquet, Excel, JSON, IPC Speaks every language. 🗣️

🗺️ The Roadmap (Manifesting Destiny) 🔮

We aren't stopping here. We are aiming for the moon. 🚀

  • Phase 1: Native App Supremacy (Rust + Tauri): The browser has limits. The Native App will have none. GPU-accelerated plotting (10M points at 144Hz) and OLED black themes.
  • Phase 2: Big Data Devourer: Cloud connectors (S3, GCS, Azure). We drink their milkshakes.

🧑‍💻 Join the Cult (Developer Guide)

You want to contribute? Good. We need strong allies.

The Blooding (Adding a Transform) 🖐️

1. Backend Implementation:

  • Define Params: Create a Pydantic model (src/pyquery_polars/core/params.py).
  • Backend Logic: Write a pure polars function (src/pyquery_polars/backend/transforms/).
  • Register: Add step to register_all_steps() in registry.py.

2. Frontend Implementation:

  • Create a Renderer Function (src/pyquery_polars/frontend/steps/).
  • Register: Add step to register_frontend() in registry_init.py.

It appears in the CLI, API, and UI automatically. 🤯

# Only certified ballers contribute code.
# Are you up for it?

📜 License

GPL-3.0. Open source forever. 💖


Made with ☕, 🦀 (Rust), and 💖 by Sudharshan TK

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyquery_polars-4.1.2.tar.gz (607.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyquery_polars-4.1.2-py3-none-any.whl (670.6 kB view details)

Uploaded Python 3

File details

Details for the file pyquery_polars-4.1.2.tar.gz.

File metadata

  • Download URL: pyquery_polars-4.1.2.tar.gz
  • Upload date:
  • Size: 607.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.13

File hashes

Hashes for pyquery_polars-4.1.2.tar.gz
Algorithm Hash digest
SHA256 93f42e9c1b8c605ae9989a5d8191d4ff84c3b4ee95d9ec33a90a8ab381c9218f
MD5 5ef1b8eace7968a1378ffc889cf63ca0
BLAKE2b-256 62af15e653b3af50ffc1e6ddb6223ab3bb846aa05d5e8d9d40595b1ae22f9d15

See more details on using hashes here.

File details

Details for the file pyquery_polars-4.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pyquery_polars-4.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 de88604c391533a7c276c3f7af16ba867cb8cbeaf0069f78dc114bccf5e58f47
MD5 564b3f345aea38ce97ff423c17c8a733
BLAKE2b-256 3dafa0f3b9a737d9703c9f9cdb89769856159e1dfe791c0b131a70843292d55c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page