Query S3 files with SQL — no database, no pipeline, no infrastructure.
Project description
s3explore
Query S3 files with SQL — no database, no pipeline, no infrastructure.
s3explore wraps chDB (ClickHouse's embedded Python engine) and boto3 to let you run SQL directly against files sitting in S3. Drop s3explore.py next to your notebook or run it from the terminal.
Works as a CLI, a Jupyter notebook library, and produces structured JSON output for piping into LLMs like Claude Code.
Prerequisites
- Python 3.9+
- AWS CLI with SSO configured (
aws configure sso) — or any credentials boto3 can resolve
Installation
pip install s3explore
Or directly from GitHub:
pip install git+https://github.com/PatrickRoyMac/s3_data_explorer.git
Try it now — no AWS account needed
Query a public dataset (Amazon product reviews, ~150M rows of Parquet on S3) straight away:
# Schema — what columns are in these files?
s3explore schema "s3://datasets-documentation/amazon_reviews/*.parquet"
# Count rows per file
s3explore count "s3://datasets-documentation/amazon_reviews/*.parquet"
# Sample 5 rows
s3explore sample --rows 5 "s3://datasets-documentation/amazon_reviews/*.parquet"
# Run your own SQL
s3explore query "s3://datasets-documentation/amazon_reviews/*.parquet" \
--sql "SELECT product_category, avg(star_rating) AS avg_stars, count() AS reviews
FROM {table}
GROUP BY product_category
ORDER BY reviews DESC
LIMIT 10"
No --profile flag needed — public buckets are accessed anonymously.
Quickstart
# See what's in a bucket
s3explore --profile my-profile ls s3://my-bucket/events/year=2025/
# Understand the schema
s3explore --profile my-profile schema s3://my-bucket/events/year=2025/*.parquet
# Count rows across files
s3explore --profile my-profile count s3://my-bucket/events/year=2025/*.parquet
# Sample 10 rows
s3explore --profile my-profile sample s3://my-bucket/events/year=2025/*.parquet
# Run your own SQL
s3explore --profile my-profile query s3://my-bucket/events/year=2025/*.parquet \
--sql "SELECT event_type, count() AS n FROM {table} GROUP BY event_type ORDER BY n DESC"
Commands
s3explore [--profile PROFILE] [--format table|json|csv] COMMAND S3_PATH [OPTIONS]
| Command | What it does | Key options |
|---|---|---|
ls |
List files at an S3 prefix (boto3) | |
schema |
Show column names and types | --fmt |
sample |
Show N sample rows | --rows N, --fmt |
count |
Count rows per file | --fmt |
query |
Run custom SQL (use {table} placeholder) |
--sql, --fmt |
Output formats
| Flag | Output | Use for |
|---|---|---|
--format table |
Pretty table | Human reading (default) |
--format json |
One JSON object/line | LLMs, pipes, scripts |
--format csv |
CSV with headers | Export, downstream tooling |
Notebook usage
Open notebook_template.ipynb, fill in the config cell, and run all cells.
import s3explore
creds = s3explore.get_credentials(profile="my-profile")
# Schema
print(s3explore.get_schema("s3://my-bucket/data/*.parquet", creds))
# Sample rows
print(s3explore.sample_rows("s3://my-bucket/data/*.parquet", creds, n=10))
# Custom query
print(s3explore.run_user_query(
"SELECT event_type, count() AS n FROM {table} GROUP BY event_type",
"s3://my-bucket/data/*.parquet",
creds,
))
The {table} placeholder in your SQL is replaced with the full s3(...) call at runtime — you never need to handle credentials in your SQL strings.
Supported file formats
Auto-detected from the file extension in the S3 path:
| Extension | Format |
|---|---|
.parquet |
Parquet |
.json / .jsonl |
JSONEachRow |
.json.gz |
JSONEachRow (auto-decompressed) |
.csv |
CSVWithNames |
.tsv |
TabSeparatedWithNames |
.gz (bare) |
JSONEachRow (best-effort) |
Override with --fmt:
s3explore schema s3://bucket/data/*.gz --fmt JSONEachRow
LLM / Claude Code usage
s3explore is designed to be consumed by command-line LLMs. Use --format json to get structured output:
# Let Claude Code explore your data
s3explore --profile my-profile --format json schema s3://bucket/data/*.parquet
s3explore --profile my-profile --format json sample s3://bucket/data/*.parquet
See CLAUDE.md for a full tool description including the recommended exploration workflow.
Troubleshooting
Credentials expired (SSO)
aws sso login --profile my-profile
Format not detected
Add --fmt with the explicit format name: Parquet, JSONEachRow, CSVWithNames.
Bare .gz files (e.g. Kinesis Firehose output)
These have no inner extension hint. s3explore defaults to JSONEachRow with a warning. Override: --fmt JSONEachRow.
How it works
- boto3 resolves AWS SSO credentials from your named profile
- boto3 lists files via
list_objects_v2for thelscommand - chDB builds and executes a
SELECT ... FROM s3('path', creds, 'Format')query in-process — no network call to any database, no cluster, no cost beyond S3 GET requests
Dependencies
chdb>=2.0.2 # ClickHouse embedded engine
boto3>=1.34.0 # AWS credential resolution + S3 listing
click>=8.1.0 # CLI
pandas>=2.0.0 # CSV export in the notebook template
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file s3explore-0.1.0.tar.gz.
File metadata
- Download URL: s3explore-0.1.0.tar.gz
- Upload date:
- Size: 8.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ad4f4050f628572d6bc49d337774564f87b1205edb2b5e90e572403844c6447
|
|
| MD5 |
7da45b417e75fda174d7229e6e36f00c
|
|
| BLAKE2b-256 |
53d756d5017e5e0c1816daaad9b896015f6f18f35d287feb8d7076b9d59004a4
|
Provenance
The following attestation bundles were made for s3explore-0.1.0.tar.gz:
Publisher:
publish.yml on PatrickRoyMac/s3_data_explorer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
s3explore-0.1.0.tar.gz -
Subject digest:
3ad4f4050f628572d6bc49d337774564f87b1205edb2b5e90e572403844c6447 - Sigstore transparency entry: 1282937958
- Sigstore integration time:
-
Permalink:
PatrickRoyMac/s3_data_explorer@249ad2662941c38d1190846f0defac50805421d5 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/PatrickRoyMac
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@249ad2662941c38d1190846f0defac50805421d5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file s3explore-0.1.0-py3-none-any.whl.
File metadata
- Download URL: s3explore-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2358b3b42e304b9a905444ea0b962e307f760e96a9afcea8c01a1c300fb6da6b
|
|
| MD5 |
69d30783efbf91f15ffdd77ff1ae41d3
|
|
| BLAKE2b-256 |
c28b9dd8ebb6c96538df010545808298d4daf533abb165e947d8ae8ee371e3ef
|
Provenance
The following attestation bundles were made for s3explore-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on PatrickRoyMac/s3_data_explorer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
s3explore-0.1.0-py3-none-any.whl -
Subject digest:
2358b3b42e304b9a905444ea0b962e307f760e96a9afcea8c01a1c300fb6da6b - Sigstore transparency entry: 1282937965
- Sigstore integration time:
-
Permalink:
PatrickRoyMac/s3_data_explorer@249ad2662941c38d1190846f0defac50805421d5 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/PatrickRoyMac
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@249ad2662941c38d1190846f0defac50805421d5 -
Trigger Event:
release
-
Statement type: