Skip to main content

Virtual warehouse — SQL over cloud Parquet via DuckDB

Project description

DataSpoc Lens

CI PyPI License Python 3.10+

SQL over cloud Parquet. Query your data lake from the terminal.

Why Lens?

Data teams store Parquet in S3, GCS, or Azure but still spin up heavy warehouses just to run SQL. DataSpoc Lens mounts cloud buckets as DuckDB views and gives you an interactive shell, notebooks, AI-powered queries, and local caching -- all from a single CLI. No servers, no infrastructure, no data copying.

Installation

pip install dataspoc-lens

Cloud and feature extras:

pip install dataspoc-lens[s3]       # AWS S3
pip install dataspoc-lens[gcs]      # Google Cloud Storage
pip install dataspoc-lens[azure]    # Azure Blob Storage
pip install dataspoc-lens[jupyter]  # JupyterLab integration
pip install dataspoc-lens[ai]       # AI natural language queries
pip install dataspoc-lens[all]      # Everything

Quick Start

1. Initialize and register a bucket

dataspoc-lens init
dataspoc-lens add-bucket s3://my-data-lake

Lens discovers tables automatically -- first from Pipe's .dataspoc/manifest.json, then by scanning for *.parquet files.

2. Explore the catalog

dataspoc-lens catalog
dataspoc-lens catalog --detail orders

3. Query with SQL

dataspoc-lens query "SELECT * FROM orders LIMIT 10"
dataspoc-lens query "SELECT status, COUNT(*) FROM orders GROUP BY status"

4. Launch the interactive shell

dataspoc-lens shell
lens> SELECT customer_id, SUM(total) FROM orders GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
lens> .tables
lens> .schema orders
lens> .export csv /tmp/orders.csv
lens> .quit

5. Configure AI and ask questions

Before using ask, configure an LLM provider:

Option A -- Local AI (free, no API key):

dataspoc-lens setup-ai

Option B -- Cloud provider:

# Anthropic (default)
export DATASPOC_LLM_API_KEY=sk-ant-...

# OpenAI
export DATASPOC_LLM_PROVIDER=openai
export DATASPOC_LLM_API_KEY=sk-...

Then ask questions in natural language:

dataspoc-lens ask "how many orders were placed yesterday?"
dataspoc-lens ask "top 10 customers by revenue this month"
dataspoc-lens ask --debug "average order value by month"

Lens sends your table schemas and sample data to the LLM, receives SQL, executes it, and prints the results. Use --debug to see the full prompt sent to the LLM.

6. Export results

Add --export to any query or ask command. Format is detected from the file extension:

dataspoc-lens query "SELECT * FROM orders" --export orders.csv
dataspoc-lens query "SELECT * FROM users" --export users.parquet
dataspoc-lens ask "monthly revenue" --export revenue.json

Features

Interactive Shell

SQL REPL with syntax highlighting, autocomplete, and history. Dot commands: .tables, .schema <table>, .buckets, .cache <table>, .export <format> <path>, .help, .quit.

Notebook

Launch JupyterLab or Marimo with all tables pre-mounted:

pip install dataspoc-lens[jupyter]
dataspoc-lens notebook

pip install dataspoc-lens[marimo]
dataspoc-lens notebook --marimo

SQL Transforms

Numbered .sql files in ~/.dataspoc-lens/transforms/ that run in order:

dataspoc-lens transform list
dataspoc-lens transform run

Cache

Copy tables locally for offline work and reduced egress costs:

dataspoc-lens cache orders              # Cache a table
dataspoc-lens cache --list              # Check status (fresh/stale)
dataspoc-lens cache orders --refresh    # Re-download
dataspoc-lens cache --clear             # Clear all

Freshness: compares your cache timestamp against the manifest's last_extraction.

Commands

dataspoc-lens init                          # Initialize configuration
dataspoc-lens add-bucket <uri>              # Register a bucket
dataspoc-lens catalog                       # List all tables
dataspoc-lens catalog --detail <table>      # Show table schema
dataspoc-lens query "<sql>"                 # Execute SQL query
dataspoc-lens query "<sql>" --export f.csv  # Execute and export
dataspoc-lens shell                         # Interactive SQL shell
dataspoc-lens ask "<question>"              # Natural language query
dataspoc-lens ask "<question>" --debug      # Show LLM prompt
dataspoc-lens setup-ai                      # Install local AI (Ollama)
dataspoc-lens notebook                      # Launch JupyterLab
dataspoc-lens notebook --marimo             # Launch Marimo
dataspoc-lens transform list                # List transform files
dataspoc-lens transform run                 # Run all transforms
dataspoc-lens cache <table>                 # Cache a table locally
dataspoc-lens cache --list                  # List cached tables
dataspoc-lens cache --clear                 # Clear cache
dataspoc-lens ml activate [key]             # Activate DataSpoc ML
dataspoc-lens ml train --target col --from tbl  # Train a model
dataspoc-lens ml predict --model m --from tbl   # Generate predictions
dataspoc-lens ml models                     # List trained models
dataspoc-lens --version                     # Show version

Part of the DataSpoc Platform

Product Role
DataSpoc Pipe Ingestion: Singer taps to Parquet in cloud buckets
DataSpoc Lens (this) Virtual warehouse: SQL + Jupyter + AI over your data lake
DataSpoc ML AutoML: train and deploy models from your lake

Pipe writes. Lens reads. ML learns.

Community

License

Apache-2.0 -- free to use, modify, and distribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataspoc_lens-0.1.1.tar.gz (64.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataspoc_lens-0.1.1-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file dataspoc_lens-0.1.1.tar.gz.

File metadata

  • Download URL: dataspoc_lens-0.1.1.tar.gz
  • Upload date:
  • Size: 64.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dataspoc_lens-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4d4003ebfde51268c867b8d3b876cb2f37cd767d5df53d7b32e4e5202b97ecb0
MD5 2615b1f1abc7b1d5d79f2e6d8bf8f74b
BLAKE2b-256 38fdc51a4122ba72d4aa250e0843904dae61242d9661176eebcd1646592f7f43

See more details on using hashes here.

File details

Details for the file dataspoc_lens-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dataspoc_lens-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dataspoc_lens-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a30a1d687e7880a9754ed28b1e1b8b7c67e01bf95750e79f20c30cbf9f4f259b
MD5 1b71c1de3b4479839f59351c6aca9283
BLAKE2b-256 19a09df58497301e6c2f6a57b2a1770493a584171a110c7473789f0bffdef4ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page