Skip to main content

Virtual warehouse — SQL over cloud Parquet via DuckDB

Project description

DataSpoc Lens

CI PyPI License Python 3.10+ Discord

SQL over cloud Parquet. Query your data lake from the terminal.

Docs  |  Tutorial  |  Discord

Why Lens?

Data teams store Parquet in S3, GCS, or Azure but still spin up heavy warehouses just to run SQL. DataSpoc Lens mounts cloud buckets as DuckDB views and gives you an interactive shell, notebooks, AI-powered queries, and local caching -- all from a single CLI. No servers, no infrastructure, no data copying.

Highlights

  • Zero infrastructure -- DuckDB runs in-process, no server or daemon
  • Multi-cloud -- S3, GCS, and Azure Blob Storage via fsspec + httpfs
  • Interactive shell -- SQL REPL with syntax highlighting and autocomplete
  • AI Ask -- natural language to SQL using Ollama, Anthropic, or OpenAI
  • Local cache -- download once, query offline
  • Transforms -- numbered .sql files for repeatable data pipelines
  • Notebook ready -- launch JupyterLab or Marimo with tables pre-mounted
  • Export anywhere -- CSV, JSON, or Parquet with a single flag

Installation

pip install dataspoc-lens[s3]
Other install options
# Google Cloud Storage
pip install dataspoc-lens[gcs]

# Azure Blob Storage
pip install dataspoc-lens[azure]

# JupyterLab integration
pip install dataspoc-lens[jupyter]

# AI natural language queries
pip install dataspoc-lens[ai]

# Everything
pip install dataspoc-lens[all]

Quick Start

# Initialize configuration
dataspoc-lens init

# Register a cloud bucket
dataspoc-lens add-bucket s3://my-data-lake

# Run a SQL query
dataspoc-lens query "SELECT * FROM orders LIMIT 10"

# Launch the interactive shell
dataspoc-lens shell

# Ask a question in plain English
dataspoc-lens ask "how many orders were placed yesterday?"

Features Overview

Shell

An interactive SQL REPL powered by prompt_toolkit with syntax highlighting, autocomplete for table and column names, and dot commands (.tables, .schema, .export, .help).

$ dataspoc-lens shell
lens> SELECT customer_id, count(*) FROM orders GROUP BY 1 ORDER BY 2 DESC LIMIT 5;

Notebook

Launch JupyterLab or Marimo with all discovered tables pre-mounted as DuckDB views -- ready to query from the first cell.

dataspoc-lens notebook

AI Ask

Turn plain English into SQL. Lens sends your schema and a data sample to the LLM, gets back SQL, and executes it.

# Default: Ollama (local, no API key needed)
dataspoc-lens ask "top 10 customers by revenue this month"

# Or use a cloud provider
export DATASPOC_LLM_PROVIDER=anthropic
export DATASPOC_LLM_API_KEY=sk-...
dataspoc-lens ask "which products have declining sales?"

Cache

Download tables to ~/.dataspoc-lens/cache/ for offline work. Cache freshness is tracked automatically and views switch between remote and local transparently.

# Inside the shell
lens> .cache orders

Transforms

Numbered .sql files in ~/.dataspoc-lens/transforms/ that run in order, writing results to the /curated/ prefix.

transforms/
  001_clean_users.sql
  002_aggregate_orders.sql
  003_build_summary.sql
dataspoc-lens transform run

Export

Add --export to any query or ask command to save results as CSV, JSON, or Parquet.

dataspoc-lens query "SELECT * FROM orders" --export csv -o orders.csv
dataspoc-lens ask "monthly revenue" --export parquet -o revenue.parquet

Access Control

DataSpoc delegates all access control to your cloud provider's IAM. The recommended pattern is one bucket per permission level:

Bucket Audience
s3://company-public All employees
s3://company-finance Finance team
s3://company-executive C-level only

Each user's cloud credentials determine which buckets they can see. If a user lacks IAM permission, add-bucket fails with "Access Denied" and no data is exposed. Lens needs only read access.

Part of the DataSpoc Platform

Layer Role
DataSpoc Pipe Ingest data from APIs into cloud storage as Parquet
DataSpoc Lens Query cloud Parquet with SQL, shell, notebooks, and AI
DataSpoc ML Machine learning on your data lake (commercial)

Pipe writes. Lens reads. ML learns.

Community

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataspoc_lens-0.1.0.tar.gz (65.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataspoc_lens-0.1.0-py3-none-any.whl (33.0 kB view details)

Uploaded Python 3

File details

Details for the file dataspoc_lens-0.1.0.tar.gz.

File metadata

  • Download URL: dataspoc_lens-0.1.0.tar.gz
  • Upload date:
  • Size: 65.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dataspoc_lens-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cf9a81cf35f315cd9124edba5e88930518548519f73d5e7f896d3da9b1114667
MD5 179f1231e7c5070035bd07e945aa9591
BLAKE2b-256 297b0ee4759feaf2cffa799989efd749fca38872744bbd306ab9042089ba36b4

See more details on using hashes here.

File details

Details for the file dataspoc_lens-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataspoc_lens-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dataspoc_lens-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 974e94a96598cc23dcbb9bf7985e9ceb08e0b357ae5be5688e010757abf5175e
MD5 aa093b992eee7424f6cd937610e43bcb
BLAKE2b-256 cc3a097de001c7f39a1411473b6d32afc6c5f0aa6c0dec81a7d76e8a03d059b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page