A lightweight SQL query engine for data exploration with lazy evaluation and intelligent optimizations
Project description
A lightweight, pure-Python SQL query engine for CSV, Parquet, JSON, JSONL, HTML, Markdown, and XML files with lazy evaluation and intelligent optimizations.
📖 Full Documentation | 🚀 Quick Start | 💬 Discussions
Quick Example
# Query a CSV file (explicit source)
$ sqlstream query data.csv "SELECT * FROM data WHERE age > 25"
# Query with inline file path (source inferred from SQL)
$ sqlstream query "SELECT * FROM 'data.csv' WHERE age > 25"
# Query JSON with nested paths
$ sqlstream query "users.json#json:data.users" "SELECT name, email FROM users"
# Query JSONL (JSON Lines) files
$ sqlstream query logs.jsonl "SELECT timestamp, level, message FROM logs WHERE level = 'ERROR'"
# Query HTML tables
$ sqlstream query "report.html#html:0" "SELECT * FROM report WHERE revenue > 1000000"
# Query Markdown tables
$ sqlstream query "README.md#markdown:1" "SELECT column1, column2 FROM readme"
# Query XML files
$ sqlstream query "data.xml#xml:record" "SELECT name, age FROM data WHERE age > 25"
# Query S3 files with partitions
$ sqlstream query "SELECT * FROM 's3://my-bucket/data/year=2024/' WHERE date > '2024-01-01'"
# Join multiple formats (CSV + Parquet + JSON)
$ sqlstream query "SELECT c.name, o.total, u.email
FROM 'customers.csv' c
JOIN 'orders.parquet' o ON c.id = o.customer_id
JOIN 'users.json#json:users' u ON c.user_id = u.id"
# Interactive shell with full TUI
$ sqlstream shell
Features
- 🚀 Pure Python - No database installation required
- 📊 Multiple Formats - CSV, Parquet, JSON, JSONL, HTML, Markdown, XML with nested path syntax, HTTP URLs, S3 buckets
- ⚡ 100x+ Faster - DuckDB backend for complex SQL, Pandas backend for simple queries
- 🔗 JOIN Support - INNER, LEFT, RIGHT, FULL OUTER joins across different file formats
- 📈 Aggregations - GROUP BY with COUNT, SUM, AVG, MIN, MAX, DISTINCT
- 🔢 Rich Type System - 10 data types (INTEGER, FLOAT, DECIMAL, STRING, JSON, BOOLEAN, DATE, TIME, DATETIME, NULL) with automatic inference
- ☁️ S3 Support - Query files directly from Amazon S3 with partition support
- 🎨 Beautiful Output - Rich tables, JSON, CSV, Parquet, Markdown formatting
- 🖥️ Advanced Interactive Shell - Multiple tabs, sidebars, layout cycling, backend toggle, state persistence, file browser
- 🔍 Smart Optimizations - Column pruning, predicate pushdown, limit pushdown, lazy evaluation
- 🌐 REST API Ready - Query HTTP endpoints and APIs (coming soon)
- 📦 Lightweight - Minimal dependencies, works everywhere
Installation
Using uv (recommended)
Basic (CSV only):
uv tool install sqlstream
All features:
uv tool install "sqlstream[all]"
Multiple Sub Dependencies:
uv tool install "sqlstream[interactive,pandas,s3,http,html,duckdb]"
Optional Dependencies:
pandas- Pandas backend for 10-100x speedupduckdb- DuckDB backend for 100x+ speedup and advanced SQLparquet- Parquet file supports3- Amazon S3 file accesshttp- HTTP/HTTPS data sourceshtml- HTML table extraction (requires pandas, html5lib, beautifulsoup4)interactive- Interactive shell with rich TUIall- All features combined
Using pip
Basic (CSV only):
pip install sqlstream
All features:
pip install "sqlstream[all]"
See Installation Guide for more options.
Quick Start
CLI Usage
# Simple query
$ sqlstream query data.csv "SELECT name, salary FROM data WHERE salary > 80000"
# With pandas backend for performance
$ sqlstream query data.csv "SELECT * FROM data" --backend pandas
# JSON output
$ sqlstream query data.csv "SELECT * FROM data" --format json
# Interactive shell with TUI
$ sqlstream shell data.csv
Interactive Shell
$ sqlstream shell
Enhanced TUI Features:
Query Management:
- Multiple Tabs (
Ctrl+T/Ctrl+W) - Work on multiple queries simultaneously - State Persistence - Auto-save tabs, queries, and layout between sessions
- Query History (
Ctrl+Up/Down) - Navigate previous queries with multiline support - Auto-completion - Schema-aware suggestions for tables and columns
- Syntax Highlighting - SQL syntax with Dracula theme
Sidebars & Layout:
- Dynamic Sidebars (
F2/F3) - Schema browser, File explorer, Filter, Export, Config - Layout Cycling (
Ctrl+L) - Resize query editor: 50%, 60%, 70%, 80%, 100% - File Browser (
Ctrl+O) - Tree view with directory navigation - Schema Browser - Real-time schema and type information
Execution & Performance:
- Backend Toggle (
F5orCtrl+B) - Cycle: Auto → DuckDB → Pandas → Python - Execution Plan (
F4) - View query optimization steps - Async Execution - Responsive UI during long queries
- Cancel Queries (
Ctrl+C) - Stop running queries
Results Management:
- Advanced Filtering (
Ctrl+F) - Column-specific or global search - Smart Export (
Ctrl+X) - CSV, JSON, or Parquet with format selection - Pagination - 100 rows per page, configurable
- Column Sorting - Click headers to sort
- Live Stats - Row counts and filter status
Keyboard Shortcuts:
Ctrl+Delete/Ctrl+Backspace- Word-aware deletionCtrl+QorCtrl+D- Exit with auto-save
Python API
from sqlstream import query
# Execute query with explicit source
results = query("data.csv").sql("SELECT * FROM data WHERE age > 25")
# Execute query with inline source (extracted from SQL)
results = query().sql("SELECT * FROM 'data.csv' WHERE age > 25")
# Iterate over results
for row in results:
print(row)
# Or convert to list
results_list = query().sql("SELECT * FROM 'data.csv'").to_list()
Documentation
Full documentation: https://subhayu99.github.io/sqlstream
Key sections:
- Quick Start Guide - Get started in 5 minutes
- SQL Reference - Supported SQL syntax
- CLI Reference - Command-line interface
- Python API - Programmatic usage
- Examples - Real-world examples
- Troubleshooting - Common issues and solutions
- Architecture - How it works
Development Status
Current Phase: 9 (Enhanced Interactive Shell - Complete!)
- ✅ Phase 0-2: Core query engine with Volcano model
- ✅ Phase 3: Parquet support
- ✅ Phase 4: Aggregations & GROUP BY
- ✅ Phase 5: JOIN operations (INNER, LEFT, RIGHT)
- ✅ Phase 5.5: Pandas backend (10-100x speedup)
- ✅ Phase 6: HTTP data sources
- ✅ Phase 7: CLI with beautiful output
- ✅ Phase 7.5: Interactive mode with Textual
- ✅ Phase 7.6: Inline file path support
- ✅ Phase 8: Type system & schema inference
- ✅ Phase 9: Enhanced interactive shell (multiple tabs, state persistence, file browser, query plan)
- ✅ Phase 10: HTML & Markdown readers with table extraction
- ✅ Phase 11: Enhanced type system (Decimal, DateTime, Date, Time, JSON)
- 🚧 Phase 12: Comprehensive testing & documentation (15% coverage → 80% target)
Test Coverage: 560 tests, 15% coverage (actively improving)
Performance
SQLStream offers three execution backends:
| Backend | Speed | Use Case |
|---|---|---|
| Python | Baseline | Learning, small files (<100K rows) |
| Pandas | 10-100x faster | Basic queries, large files (>100K rows) |
| DuckDB | 100x+ faster | Complex SQL, analytics, huge files (10M+ rows) |
Benchmark (1M rows):
- Python backend: 52s
- Pandas backend: 0.8s ⚡ 65x faster
Architecture
SQLStream uses the Volcano iterator model for query execution:
SQL Query → Parser → AST → Planner → Optimizer → Executor → Results
↓
(Column Pruning, Predicate Pushdown,
Lazy Evaluation)
Key concepts:
- Lazy Evaluation: Rows are processed on-demand
- Column Pruning: Only read columns that are used
- Predicate Pushdown: Apply filters early to reduce data scanned
- Three Backends: Pure Python (learning), Pandas (performance), and DuckDB (full SQL)
See Architecture Guide for details.
Contributing
Contributions are welcome! See Contributing Guide for details.
Development setup:
# Clone repository
git clone https://github.com/subhayu99/sqlstream.git
cd sqlstream
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
ruff format .
ruff check .
License
MIT License - see LICENSE for details.
Built with ❤️ by the SQLStream Team
📖 Documentation • 🐛 Issues • 💬 Discussions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sqlstream-0.6.0.tar.gz.
File metadata
- Download URL: sqlstream-0.6.0.tar.gz
- Upload date:
- Size: 2.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
978e2861f18cbdba13a87bc8492d130fe5be16edcdf7f4af25858b7fd03c8bab
|
|
| MD5 |
6453fb6861bd2f2ed3ce73e477eda0fc
|
|
| BLAKE2b-256 |
3de1bbcb9a4752dd3ee6bef40601353e3b15c837a5e59eae0f7f7fbdcbc58da7
|
File details
Details for the file sqlstream-0.6.0-py3-none-any.whl.
File metadata
- Download URL: sqlstream-0.6.0-py3-none-any.whl
- Upload date:
- Size: 124.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4232f5b7c55ec236b8f94f3f88fd495407d26c760288c7790978869bc0bf796f
|
|
| MD5 |
d257657a74011424c12e38fa21aabef2
|
|
| BLAKE2b-256 |
431b4eed8506e3dc8e19ca7ea7d370f98898b1ab4af33da3c0bb7d1ec56ab0f5
|