Skip to main content

A lightweight SQL query engine for data exploration with lazy evaluation and intelligent optimizations

Project description

SQLStream

A lightweight, pure-Python SQL query engine for CSV and Parquet files with lazy evaluation and intelligent optimizations.

Tests Documentation License

📖 Full Documentation | 🚀 Quick Start | 💬 Discussions


Quick Example

# Query a CSV file
$ sqlstream query "SELECT * FROM 'data.csv' WHERE age > 25"

# Join multiple files
$ sqlstream query "SELECT c.name, o.total FROM 'customers.csv' c JOIN 'orders.csv' o ON c.id = o.customer_id"

# Interactive mode for wide tables
$ sqlstream query data.csv "SELECT * FROM data" --interactive

Features

  • 🚀 Pure Python - No database installation required
  • 📊 Multiple Formats - CSV, Parquet files, HTTP URLs
  • 10-100x Faster - Optional pandas backend for performance
  • 🔗 JOIN Support - INNER, LEFT, RIGHT joins
  • 📈 Aggregations - GROUP BY with COUNT, SUM, AVG, MIN, MAX
  • 🔢 Type System - Automatic schema inference with type checking
  • 🎨 Beautiful Output - Rich tables, JSON, CSV formatting
  • 🖥️ Interactive Mode - Scrollable table viewer with Textual
  • 🔍 Smart Optimizations - Column pruning, predicate pushdown, lazy evaluation
  • 📦 Lightweight - Minimal dependencies, works everywhere

Installation

Basic (CSV only):

pip install sqlstream

All features (recommended):

pip install "sqlstream[all]"

See Installation Guide for more options.

Quick Start

CLI Usage

# Simple query
$ sqlstream query data.csv "SELECT name, salary FROM data WHERE salary > 80000"

# With pandas backend for performance
$ sqlstream query data.csv "SELECT * FROM data" --backend pandas

# JSON output
$ sqlstream query data.csv "SELECT * FROM data" --format json

# Interactive mode
$ sqlstream query data.csv "SELECT * FROM data" --interactive

Python API

from sqlstream import query

# Execute query (lazy evaluation)
results = query("data.csv").sql("SELECT * FROM data WHERE age > 25")

# Iterate over results
for row in results:
    print(row)

# Or convert to list
results_list = query("data.csv").sql("SELECT * FROM data").to_list()

Documentation

Full documentation: https://subhayu99.github.io/sqlstream

Key sections:

Development Status

Current Phase: 8 (Type System & Schema Inference)

  • Phase 0-2: Core query engine with Volcano model
  • Phase 3: Parquet support
  • Phase 4: Aggregations & GROUP BY
  • Phase 5: JOIN operations (INNER, LEFT, RIGHT)
  • Phase 5.5: Pandas backend (10-100x speedup)
  • Phase 6: HTTP data sources
  • Phase 7: CLI with beautiful output
  • Phase 7.5: Interactive mode with Textual
  • Phase 7.6: Inline file path support
  • Phase 8: Type system & schema inference
  • 🚧 Phase 9: Error handling & user feedback
  • 🚧 Phase 10: Testing & documentation

Test Coverage: 358 tests, 53% coverage

Performance

SQLStream offers two execution backends:

Backend Speed Use Case
Python Baseline Learning, small files (<100K rows)
Pandas 10-100x faster Production, large files (>100K rows)

Benchmark (1M rows):

  • Python backend: 52s
  • Pandas backend: 0.8s ⚡ 65x faster

Architecture

SQLStream uses the Volcano iterator model for query execution:

SQL Query → Parser → AST → Planner → Optimizer → Executor → Results
                                          ↓
                            (Column Pruning, Predicate Pushdown,
                             Lazy Evaluation)

Key concepts:

  • Lazy Evaluation: Rows are processed on-demand
  • Column Pruning: Only read columns that are used
  • Predicate Pushdown: Apply filters early to reduce data scanned
  • Two Backends: Pure Python (learning) and Pandas (performance)

See Architecture Guide for details.

Contributing

Contributions are welcome! See Contributing Guide for details.

Development setup:

# Clone repository
git clone https://github.com/subhayu99/sqlstream.git
cd sqlstream

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
ruff format .
ruff check .

License

MIT License - see LICENSE for details.


Built with ❤️ by the SQLStream Team

📖 Documentation • 🐛 Issues • 💬 Discussions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlstream-0.1.0.tar.gz (303.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqlstream-0.1.0-py3-none-any.whl (57.3 kB view details)

Uploaded Python 3

File details

Details for the file sqlstream-0.1.0.tar.gz.

File metadata

  • Download URL: sqlstream-0.1.0.tar.gz
  • Upload date:
  • Size: 303.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for sqlstream-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2bf560e10f8a4f2974537943d79b4488feef3b670bf1323eefe78f907e6568b8
MD5 41f6e65bc9233dc10f2576056074164c
BLAKE2b-256 c4060a8ac3b6e6d9a746251e51598e030f8237e591ff616d213d0e5fae94d3db

See more details on using hashes here.

File details

Details for the file sqlstream-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sqlstream-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 57.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for sqlstream-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 db8e5d29b56361a5f61ca51b8355201960d435df465c53233ab064732cd7dc1d
MD5 01d4c8e9656351527cf06dbb630d778d
BLAKE2b-256 9e662ee2161586dad37db2652f800a68be5cfefd7975b58a4c662949d97e38f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page