Skip to main content

Query S3 files with SQL — no database, no pipeline, no infrastructure.

Project description

s3explore

Query S3 files with SQL — no database, no pipeline, no infrastructure.

s3explore wraps chDB (ClickHouse's embedded Python engine) and boto3 to let you run SQL directly against files sitting in S3. Drop s3explore.py next to your notebook or run it from the terminal.

Works as a CLI, a Jupyter notebook library, and produces structured JSON output for piping into LLMs like Claude Code.


Prerequisites

  • Python 3.9+
  • AWS CLI with SSO configured (aws configure sso) — or any credentials boto3 can resolve

Installation

pip install s3explore

Or directly from GitHub:

pip install git+https://github.com/PatrickRoyMac/s3_data_explorer.git

Try it now — no AWS account needed

Query a public dataset (Amazon product reviews, ~150M rows of Parquet on S3) straight away:

# Schema — what columns are in these files?
s3explore schema "s3://datasets-documentation/amazon_reviews/*.parquet"

# Count rows per file
s3explore count "s3://datasets-documentation/amazon_reviews/*.parquet"

# Sample 5 rows
s3explore sample --rows 5 "s3://datasets-documentation/amazon_reviews/*.parquet"

# Run your own SQL
s3explore query "s3://datasets-documentation/amazon_reviews/*.parquet" \
  --sql "SELECT product_category, avg(star_rating) AS avg_stars, count() AS reviews
         FROM {table}
         GROUP BY product_category
         ORDER BY reviews DESC
         LIMIT 10"

No --profile flag needed — public buckets are accessed anonymously.


Quickstart

# See what's in a bucket
s3explore --profile my-profile ls s3://my-bucket/events/year=2025/

# Understand the schema
s3explore --profile my-profile schema s3://my-bucket/events/year=2025/*.parquet

# Count rows across files
s3explore --profile my-profile count s3://my-bucket/events/year=2025/*.parquet

# Sample 10 rows
s3explore --profile my-profile sample s3://my-bucket/events/year=2025/*.parquet

# Run your own SQL
s3explore --profile my-profile query s3://my-bucket/events/year=2025/*.parquet \
  --sql "SELECT event_type, count() AS n FROM {table} GROUP BY event_type ORDER BY n DESC"

Commands

s3explore [--profile PROFILE] [--format table|json|csv] COMMAND S3_PATH [OPTIONS]
Command What it does Key options
ls List files at an S3 prefix (boto3)
schema Show column names and types --fmt
sample Show N sample rows --rows N, --fmt
count Count rows per file --fmt
query Run custom SQL (use {table} placeholder) --sql, --fmt

Output formats

Flag Output Use for
--format table Pretty table Human reading (default)
--format json One JSON object/line LLMs, pipes, scripts
--format csv CSV with headers Export, downstream tooling

Notebook usage

Open notebook_template.ipynb, fill in the config cell, and run all cells.

import s3explore

creds = s3explore.get_credentials(profile="my-profile")

# Schema
print(s3explore.get_schema("s3://my-bucket/data/*.parquet", creds))

# Sample rows
print(s3explore.sample_rows("s3://my-bucket/data/*.parquet", creds, n=10))

# Custom query
print(s3explore.run_user_query(
    "SELECT event_type, count() AS n FROM {table} GROUP BY event_type",
    "s3://my-bucket/data/*.parquet",
    creds,
))

The {table} placeholder in your SQL is replaced with the full s3(...) call at runtime — you never need to handle credentials in your SQL strings.


Supported file formats

Auto-detected from the file extension in the S3 path:

Extension Format
.parquet Parquet
.json / .jsonl JSONEachRow
.json.gz JSONEachRow (auto-decompressed)
.csv CSVWithNames
.tsv TabSeparatedWithNames
.gz (bare) JSONEachRow (best-effort)

Override with --fmt:

s3explore schema s3://bucket/data/*.gz --fmt JSONEachRow

LLM / Claude Code usage

s3explore is designed to be consumed by command-line LLMs. Use --format json to get structured output:

# Let Claude Code explore your data
s3explore --profile my-profile --format json schema s3://bucket/data/*.parquet
s3explore --profile my-profile --format json sample s3://bucket/data/*.parquet

See CLAUDE.md for a full tool description including the recommended exploration workflow.


Troubleshooting

Credentials expired (SSO)

aws sso login --profile my-profile

Format not detected Add --fmt with the explicit format name: Parquet, JSONEachRow, CSVWithNames.

Bare .gz files (e.g. Kinesis Firehose output) These have no inner extension hint. s3explore defaults to JSONEachRow with a warning. Override: --fmt JSONEachRow.


How it works

  1. boto3 resolves AWS SSO credentials from your named profile
  2. boto3 lists files via list_objects_v2 for the ls command
  3. chDB builds and executes a SELECT ... FROM s3('path', creds, 'Format') query in-process — no network call to any database, no cluster, no cost beyond S3 GET requests

Dependencies

chdb>=2.0.2      # ClickHouse embedded engine
boto3>=1.34.0    # AWS credential resolution + S3 listing
click>=8.1.0     # CLI
pandas>=2.0.0    # CSV export in the notebook template

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3explore-0.1.0.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

s3explore-0.1.0-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file s3explore-0.1.0.tar.gz.

File metadata

  • Download URL: s3explore-0.1.0.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for s3explore-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3ad4f4050f628572d6bc49d337774564f87b1205edb2b5e90e572403844c6447
MD5 7da45b417e75fda174d7229e6e36f00c
BLAKE2b-256 53d756d5017e5e0c1816daaad9b896015f6f18f35d287feb8d7076b9d59004a4

See more details on using hashes here.

Provenance

The following attestation bundles were made for s3explore-0.1.0.tar.gz:

Publisher: publish.yml on PatrickRoyMac/s3_data_explorer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file s3explore-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: s3explore-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for s3explore-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2358b3b42e304b9a905444ea0b962e307f760e96a9afcea8c01a1c300fb6da6b
MD5 69d30783efbf91f15ffdd77ff1ae41d3
BLAKE2b-256 c28b9dd8ebb6c96538df010545808298d4daf533abb165e947d8ae8ee371e3ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for s3explore-0.1.0-py3-none-any.whl:

Publisher: publish.yml on PatrickRoyMac/s3_data_explorer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page