Enterprise-grade Headless ETL Engine with Interactive UI
Project description
⚡ PyQuery: The Main Character of Data Stacks 💫
ETL. EDA. ML. SQL. IDE.
🚩 Stop letting Pandas hold you back.
The single-threaded era is over.
PyQuery is a local-first data operating system that auto-heals broken CSVs, includes a native Code Editor, and processes 100GB+ files without breaking a sweat. ⚡
🎮 The Ecosystem (Choose Your Path)
We built a suite of tools so perfect it hurts.
| Path | Vibe | Description | Link |
|---|---|---|---|
| CLI | 🏎️ Speedrun | The Headless Beast. Run data pipelines in your sleep. | CLI Manual |
| UI | 🎨 Creative | The Visual Studio. Drag, drop, analyze, visualize. | UI Guide |
| API | 📡 Backbone | The Server. Build your own apps on our engine. | API Docs |
| SDK | 🐍 Sorcery | The Python Library. For the code wizards. | SDK Guide |
🧠 TL;DR (For the goldfish attention spans)
✨ New Drop: Headless Ghost Mode 👻 PyQuery now supports total Headless Automation. Run massive pipelines in CI/CD, schedule tasks, and bypass the UI entirely with the re-architected
runcommand.
- Install it:
pip install pyquery-polars(Don't be basic). - Run it:
pyquery ui(Visuals) orpyquery run(Speedrun/Headless). - The Flex: It's a local-first, privacy-focused engine that eats Excel sheets and CSVs for breakfast using Rust.
⛩️ The Awakening (Lore)
Long ago, the Data World was mid. Analysts lived in fear of the MemoryError. They bowed before the single-threaded tyranny of the Old Gods (Pandas). They accepted their fate of freezing screens, crashing kernels, and waiting 4 hours for a simple groupby.
But I refused.
From the depths of the Rusty abyss, PyQuery has awakened. I am not just an ETL tool anymore. I am the entire war room. I am here to obliterate your bottlenecks and ratio your old benchmarks.
The Core Philosophy (Our Ninja Way) 🥷
- Lazy Execution: Nothing computes until you say "Export". This optimizes memory and speed so your hardware doesn't scream.
- Zero-Copy: Data is processed efficiently without redundant copies. We don't waste bits.
- Strict & Clean: Enforces strict typing and argument validation. No ambiguous magic, just pure logic.
- Automation First: While the UI is gorgeous, PyQuery is built to run alone in the dark.
Welcome to your Villain Arc. 👹
🧾 PyQuery vs. Power Query: The Roast
We don't usually punch down, but you handed us the gloves.
| Feature | ⚡ PyQuery (The Chad) | 🐢 Power Query (The Virgin) |
|---|---|---|
| Speed | Rust-Powered. Processes millions of rows before you blink. | Single-Threaded. Spends 20 mins saying "Loading Data..." just to crash. |
| Language | Python/SQL/Polars. The languages of gods. | M-Code. A language invented to punish humanity. |
| AI/ML | Built-in. Random Forests, Clustering, & Monte Carlo Sims. | Non-existent. You need a generic "AI Plugin" that costs extra. |
| Vibe | Dark Mode CLI & Streamlit. Cyberpunk aesthetic. | Corporate Grey. It sucks the soul out of your body. |
| Price | Free & Open Source. | Requires an Office 365 License (Subscription L). |
| Boot XP | Cinematic CLI with Themes & Logs | Static Spinner of Doom |
| Broken CSVs | Auto-healed at ingest | Crashes silently |
| One Bad File | Isolated & corrected | Pipeline dead |
| Headless | Full CLI Automation. Designed for CI/CD pipelines. | UI Dependent. Good luck automating that in a Linux shell. |
🖥️ The Main Character CLI (The Experience)
This is not a command line. This is a startup ritual.
Every time PyQuery boots, it behaves like a data OS coming online.
⚡ Adaptive Theme Engine
The CLI dynamically switches color gradients, borders, and mood based on your selected boot mode. Each theme announces itself during startup. You feel it before you run anything.
- Cyberpunk: (Default) Neon main-character energy.
- Rustacean: Pure Polars lore.
- Matrix: Hacker-core, green text supremacy.
- Villain Arc: Purple & gold. No mercy.
👻 Headless Revamp: The run Command
The CLI has been completely re-architected for Automation Supremacy. The run command is your primary entry point for headless operations.
# Basic Speedrun
pyquery run --source data.csv --output results.parquet
# Project Mode (Load the whole squad)
pyquery run --project daily_report.pyquery --output dist/
🛠️ Execution Modes:
- Source Mode (--source): Quick ad-hoc processing of single files, SQL queries, or APIs.
- Project Mode (--project): Load a predefined .pyquery project file containing multiple datasets and recipes.
Note: These flags are mutually exclusive. Choose your path.
📟 Sequential Boot Logs
Real-time kernel-style logs with cinematic pacing. It doesn’t say "loading"... It declares intent.
- Timestamped steps.
- Module icons (
⚡ Engine,💾 IO,🧠 Planner). - Your terminal doesn’t just start PyQuery. It witnesses it.
🧩 Focused UI (Modal Upgrade)
Sidebars are for tourists. PyQuery loads data through dedicated modal dialogs—because loading data is a moment, not a side quest.
- Blazing-Fast & Optimistic: The dialog opens instantly.
- Lazy Preview: We scan 100k+ files without freezing the UI.
- Recent Paths: We remember so you don't have to.
- Preview Before Commit: See matched files and sheets before you import. You don't guess anymore; you confirm with intent.
💪 The Flex (Capabilities)
We built an empire so you can rule yours. This isn't just software; it's a lifestyle.
🎯 EDA: The Crystal Ball (Expanded)
"Most tools describe the past. PyQuery predicts the future."
EDA is no longer just "looking at data". It's hunting.
1. 🧬 Dataset DNA & Health Check
We scan your data's soul.
- Missing Cells: We don't just count nulls; we judge them. (<1% is excellence, >10% is sloppy).
- Cardinality Checks: Instantly know if a column is categorical or continuous.
- Duplicate Detection: We find the clones and eliminate them.
2. 🚀 The Action Engine (ML Strategist)
- Strategic Brief: A "Top 3 Insights" card that ranks every signal in your data. It whispers: "The money is here."
- Automated Drivers: It finds the hidden variables controlling your target.
- "Why is Churn high? It's not Price. It's Customer Support Wait Time > 5m." -> Boom. Solved.
- Correlation Matrix: Pearson, Cramer’s V, and F-Tests calculated automatically. We know the relationships better than you know your own situationship.
3. 🧪 ML Laboratory (The Brain)
- Auto-Pilot Mode: Trains an army of models (Random Forest, Lasso, Ridge) to find the best fit. You sit back and look busy.
- Clustering (Unsupervised Rizz): Elbow Plots & Silhouette Scores optimization. We even name the segments for you ("Cluster 1 = High Spend, Low Age").
- Explainable Anomalies: Uses Isolation Forests to catch the weirdos and fraudsters instantly, with a Contextual Profiler to tell you why they are weird.
4. 🎮 Decision Simulator (The Time Machine)
- "What-If" Sliders: Change variables in real-time. "If I raise Price by 10% and lower ad spend, do I still profit?"
- Monte Carlo Sims: Run 1,000+ simulations. We don't guess; we calculate the probability of your success.
- Waterfall Analysis: The Model breaks down exactly why the prediction changed.
5. 📈 Time Series & Visuals That Don't Miss
- Holt-Winters Forecasting: Predicting the future with confidence intervals.
- Decomposition: Splitting data into Trend, Seasonality, and Noise.
- Cohort Comparison: Volcano Plots visualizing "Effect Size" vs "Significance." We bring the science.
💻 The Integrated IDE (Code is Power)
For those who speak the language of the gods (Python/SQL), we built a React-based Code Editor right inside the UI.
- Embedded Ace Editor: Syntax highlighting, line numbers, and active line focus. Feels like VS Code, lives in your browser.
- Intelligent Auto-Completions: Context-aware suggestions for
pl,np,math. Typecolgetcol("name"). It knows your schema. - Sandboxed Custom Scripts:
- AST-Validated Security: We parse your code before execution.
- Blocked:
import os, private attributes, system calls. - Allowed:
numpy,scipy,sklearn. Pure math and logic only.
🧪 SQL Lab: The Codex (God Mode)
For when the GUI is too easy and you want to flex raw SQL. This isn't SQLite. This is High-Performance Lazy SQL.
- Zero-Lag Querying: Run
SELECT *on a 50GB file? It pulls a preview instantly. The engine effectively cheats physics. - Cross-Dataset Joins: Join
sales.csvwithtargets.xlsxusing standard SQL. - Materialize: Execute complex queries, then save as a new dataset.
🧹 The Forge (Ruthless ETL)
Backend I/O that actually understands real-world data. Real data is cursed. We planned for that.
- 🧬 Advanced Auto-Encoding Healer:
- Scans the first bytes of every CSV to automatically fix
UnicodeDecodeError. - Stream-Based Healing: Processes multi-GB files in 4MB chunks. Memory usage stays flat.
- Sanitization: Strips
Null Bytes, normalizes newlines, and replaces garbage.
- Scans the first bytes of every CSV to automatically fix
- 🧩 Mixed-Encoding Folder Handling:
- If a folder contains files with different encodings, PyQuery detects it and switches strategy automatically.
- We isolate. We adapt. We continue.
- 📂 Recursive Folder Globbing (Upgraded):
- Patterns like
data/**/*.csvwork even when schemas differ slightly or headers are misaligned.
- Patterns like
- 🏗️ Staging Ground (Infrastructure Rizz):
- Control your intermediate storage. If your
%TEMP%partition is small, tell PyQuery where the real space is using thePYQUERY_STAGING_DIRenvironment variable.
# Linux/Mac Power Move export PYQUERY_STAGING_DIR="/mnt/fast_ssd/pyquery_cache" pyquery run ...
- Control your intermediate storage. If your
- 🔍 Advanced File Filtering (Precision Strikes):
- Multiple Filter Types:
Glob,Regex,Contains,Not Contains,Exact,Is Not. - Stackable Logic: Must contain
sales+ Must NOT containbackup+ Must match regex\d{4}. - This is surgical file selection. No more loading junk and cleaning later.
- Multiple Filter Types:
- 📊 Excel Handling That Respects Your Sanity:
- Multi-Sheet Selection: Load one sheet, many sheets, or only the ones that matter.
- Template-Based Mapping: Pick a base file, preview its sheets, and apply that selection across all matching files.
- Sheet Name Filtering: Regex-powered selection like
Q[1-4]_Data.
- ✨ Source Awareness & Cleanliness:
- Metadata Injection: Automatically add
__source_path__and__source_name__. - Auto Type Inference: Samples data, infers dtypes, and instantly appends a Clean & Cast step.
- Metadata Injection: Automatically add
- ✨ Auto-Typecast: One click scans rows and forcibly converts
StringstoInt,Float, orDate. - 🎭 PII Incinerator: Detects and obfuscates credit cards and SSNs. Secrets remain secret.
- 🩹 Smart Impute: Fill the voids. Forward fill, backward fill, median, or specific value injection. No null survives.
- 💥 Explode & Coalesce: Flatten lists and merge columns like a boss.
🧠 The Tech Stack (Forbidden Knowledge) 🐐
This isn't just a library. It's a weapon system.
1. 🌊 The "Infinite Stream" Glitch (Lazy Execution)
The Old Gods (Pandas) are Eager. They try to swallow the ocean (RAM) whole. They choke. PyQuery is Lazy. It waits. It plans.
- Scan: "It's a 100GB file. Interesting."
- Plan: Filters, joins, math. Nothing executes until the final blow.
- Stream: Data flows in chunks. Process. Write. Destroy.
- Result: Processing 100GB on a MacBook Air. The laws of physics are optional.
2. ⚙️ File-Level Execution Control
Most engines think in datasets. PyQuery thinks in files.
- Individual File Processing: Forces the engine to load files one-by-one instead of bulk scanning.
- Why it matters: One corrupted CSV no longer nukes the entire pipeline. We fix schemas and clean data before concatenation. This is how PyQuery survives enterprise-grade mess.
3. 🚀 Streaming I/O Architecture
We rewired the backend for scale.
- True Streaming Discovery: Uses generators and lazy iteration. Point at 100k files without crashing.
- Partial Globbing: Simple text filters convert to filesystem-level globs. Python never even sees irrelevant files.
4. 🛡️ Type Safety (Absolute Order)
Python is dynamic (chaotic). PyQuery imposes Order.
- Every step is backed by a Pydantic Model.
- If a
Stringtries to infiltrate aFloatcolumn, it is terminated before execution. - No runtime surprises. Only calculated victories.
🧾 The Receipts (Benchmarks)
We don't post without proof. We mog the competition.
| Metric | 🐼 Pandas (Legacy) | ⚡ PyQuery (Polars) | The Diff |
|---|---|---|---|
| Load 10GB CSV | MemoryError (Crash) 💥 |
0.2s (Lazy Scan) ⚡ | Infinite |
| Filter Rows | 15.4s (Slow) | 0.5s (Parallel) | 30x Faster |
| Group By | 45s (Painful) | 2.1s (Instant) | 20x Faster |
| RAM Usage | 12GB+ (Bloated) | 500MB (Lean) | 95% Less |
Benchmarks run on a standard dev laptop. Results may vary but the vibe remains consistent.
🎮 Choose Your Fighter (4 Paths to Power)
We don't limit you. Dominate however you choose.
📦 Installation
pip install pyquery-polars
1. 🌊 The GUI (God Mode)
For when you want to click things, see pretty charts, and feel like a data scientist in a sci-fi movie.
- Visual Recipe Builder: Nodes and edges of pure logic.
- Native File Picker: Access local filesystem directly.
pyquery ui
# Launches the Web App on localhost:8501 🚀
2. 🤖 The API (Headless Beast)
Building a machine? Run PyQuery as the engine.
- Swagger Docs: Auto-generated at
/docs. - Async: Fire and forget jobs via
POST /recipes/run.
pyquery api
# Serving high-performance ETL over HTTP at localhost:8000 📡
3. ⚡ The Batch Runner (Speedrun)
For automation. No interface. Just speed.
pyquery run -s input.csv -r recipe.json -o output.parquet
# Task complete. ⚡
4. 🧙♂️ The Sorcerer (Python SDK)
For the developers who want to weave PyQuery into their own code.
from pyquery_polars.backend.engine import PyQueryEngine
# Full programmatic control over the recipe engine.
# You are the architect now.
🧰 The Loadout (Arsenal)
Packed with every tool needed to clear the map.
| Category | The Tools | Why it slaps |
|---|---|---|
| Cleaning | Fill Nulls, Mask PII, Smart Extract, Regex |
Turns garbage data into gold. ✨ |
| Analytics | Rolling Agg, Time Bin, Rank, Diff, Z-Score |
High-frequency trading vibes. 📈 |
| Combining | Smart Join, Concat, Pivot, Unpivot |
Merge datasets without the headache. 🤝 |
| Math | Log, Exp, Clip, Date Offset |
For the scientific girlies. 👩🔬 |
| Text | Slice, Case, Replace, One-Hot |
String manipulation on steroids. 💪 |
| I/O | CSV, Parquet, Excel, JSON, IPC |
Speaks every language. 🗣️ |
🗺️ The Roadmap (Manifesting Destiny) 🔮
We aren't stopping here. We are aiming for the moon. 🚀
- Phase 1: Native App Supremacy (Rust + Tauri): The browser has limits. The Native App will have none. GPU-accelerated plotting (10M points at 144Hz) and OLED black themes.
- Phase 2: Big Data Devourer: Cloud connectors (S3, GCS, Azure). We drink their milkshakes.
🧑💻 Join the Cult (Developer Guide)
You want to contribute? Good. We need strong allies.
The Blooding (Adding a Transform) 🖐️
1. Backend Implementation:
- Define Params: Create a Pydantic model (
src/pyquery_polars/core/params.py). - Backend Logic: Write a pure polars function (
src/pyquery_polars/backend/transforms/). - Register: Add step to
register_all_steps()inregistry.py.
2. Frontend Implementation:
- Create a Renderer Function (
src/pyquery_polars/frontend/steps/). - Register: Add step to
register_frontend()inregistry_init.py.
It appears in the CLI, API, and UI automatically. 🤯
# Only certified ballers contribute code.
# Are you up for it?
📜 License
GPL-3.0. Open source forever. 💖
Made with ☕, 🦀 (Rust), and 💖 by Sudharshan TK
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyquery_polars-4.1.2.tar.gz.
File metadata
- Download URL: pyquery_polars-4.1.2.tar.gz
- Upload date:
- Size: 607.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93f42e9c1b8c605ae9989a5d8191d4ff84c3b4ee95d9ec33a90a8ab381c9218f
|
|
| MD5 |
5ef1b8eace7968a1378ffc889cf63ca0
|
|
| BLAKE2b-256 |
62af15e653b3af50ffc1e6ddb6223ab3bb846aa05d5e8d9d40595b1ae22f9d15
|
File details
Details for the file pyquery_polars-4.1.2-py3-none-any.whl.
File metadata
- Download URL: pyquery_polars-4.1.2-py3-none-any.whl
- Upload date:
- Size: 670.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de88604c391533a7c276c3f7af16ba867cb8cbeaf0069f78dc114bccf5e58f47
|
|
| MD5 |
564b3f345aea38ce97ff423c17c8a733
|
|
| BLAKE2b-256 |
3dafa0f3b9a737d9703c9f9cdb89769856159e1dfe791c0b131a70843292d55c
|