Feathers and claws for your data lake
Project description
owlbear
Feathers and claws for your data lake.
Owlbear is a Python client that bridges Athena and Trino to Polars DataFrames via PyArrow. A wise chimera — part Owl (Athena, goddess of wisdom), part Bear (Polars, the bear constellation). Query your data lake with SQL, get back fast, typed DataFrames — no serialization or ODBC overhead.
Features
- Two backends:
AthenaClient(AWS Athena via boto3) andTrinoClient(direct Trino connection) - Shared Presto-family type conversion — both backends produce identically typed Polars DataFrames
- Parameterized queries for safe value binding
- Athena result reuse for repeated queries
- Pagination support for large result sets (Athena) and row limits (both)
- Query cancellation and execution monitoring (Athena)
- Built-in retry logic with exponential backoff (Athena)
- MCP server for AI assistant integration (schema discovery + query execution)
Installation
# With Athena backend
pip install "owlbear[athena]"
# With Trino backend
pip install "owlbear[trino]"
# Both backends
pip install "owlbear[all]"
# MCP server (includes both backends)
pip install "owlbear[mcp]"
For Development
git clone https://github.com/jdonaldson/owlbear.git
cd owlbear
pip install -e ".[dev]"
Prerequisites
- Python 3.10+
- Athena: AWS credentials configured (via AWS CLI, environment variables, or IAM roles) and an S3 bucket for query results
- Trino: A running Trino cluster with network access
Quick Start
Athena
from owlbear import AthenaClient
client = AthenaClient(
database="my_database",
output_location="s3://my-bucket/athena-results/",
region="us-east-1"
)
execution_id = client.query("SELECT * FROM orders LIMIT 5")
df = client.results(execution_id)
print(df)
shape: (5, 4)
┌─────────────┬────────────┬──────────────┬────────────┐
│ customer_id ┆ order_date ┆ order_amount ┆ status │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ f64 ┆ str │
╞═════════════╪════════════╪══════════════╪════════════╡
│ 1001 ┆ 2024-03-15 ┆ 249.99 ┆ shipped │
│ 1002 ┆ 2024-03-15 ┆ 89.50 ┆ delivered │
│ 1003 ┆ 2024-03-16 ┆ 1024.00 ┆ processing │
│ 1001 ┆ 2024-03-17 ┆ 54.25 ┆ shipped │
│ 1004 ┆ 2024-03-17 ┆ 399.99 ┆ delivered │
└─────────────┴────────────┴──────────────┴────────────┘
Trino
from owlbear import TrinoClient
client = TrinoClient(
host="trino.example.com",
port=443,
user="analyst",
catalog="hive",
schema="default",
)
df = client.query("SELECT * FROM orders LIMIT 5")
print(df)
shape: (5, 4)
┌─────────────┬────────────┬──────────────┬────────────┐
│ customer_id ┆ order_date ┆ order_amount ┆ status │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ f64 ┆ str │
╞═════════════╪════════════╪══════════════╪════════════╡
│ 1001 ┆ 2024-03-15 ┆ 249.99 ┆ shipped │
│ 1002 ┆ 2024-03-15 ┆ 89.50 ┆ delivered │
│ 1003 ┆ 2024-03-16 ┆ 1024.00 ┆ processing │
│ 1001 ┆ 2024-03-17 ┆ 54.25 ┆ shipped │
│ 1004 ┆ 2024-03-17 ┆ 399.99 ┆ delivered │
└─────────────┴────────────┴──────────────┴────────────┘
Usage Examples
Parameterized Queries
# Athena — parameters are passed as strings
execution_id = client.query(
"SELECT * FROM orders WHERE customer_id = ?",
parameters=["1001"],
)
df = client.results(execution_id)
# Trino — parameters are passed as native Python values
df = trino_client.query(
"SELECT * FROM orders WHERE customer_id = ?",
parameters=[1001],
)
Result Reuse (Athena)
# Re-use cached results for up to 60 minutes
execution_id = client.query(
"SELECT COUNT(*) FROM orders",
result_reuse_max_age=60,
)
Asynchronous Query Execution
# Start query without waiting
execution_id = client.query(
"SELECT * FROM large_table",
wait_for_completion=False
)
# Check query status
query_info = client.get_query_info(execution_id)
print(f"Query status: {query_info['Status']['State']}")
# Wait for completion and get results when ready
client._wait_for_completion(execution_id)
df = client.results(execution_id)
Using Work Groups
execution_id = client.query(
query="SELECT COUNT(*) FROM my_table",
work_group="my-workgroup"
)
df = client.results(execution_id)
Using with Existing boto3 Session
import boto3
from owlbear import AthenaClient
session = boto3.Session(profile_name='my-profile')
client = AthenaClient.from_session(
session=session,
database="my_db",
output_location="s3://my-bucket/results/"
)
Query Management
# List available work groups
work_groups = client.list_work_groups()
# Cancel a running query
client.cancel_query(execution_id)
# Get detailed query information
query_info = client.get_query_info(execution_id)
print(f"Execution time: {query_info['Statistics']['TotalExecutionTimeInMillis']}ms")
print(f"Data processed: {query_info['Statistics']['DataProcessedInBytes']} bytes")
MCP Server
Owlbear includes an MCP server so AI assistants can query your data lake directly.
Install
pip install "owlbear[mcp]"
Tools
| Tool | Description |
|---|---|
execute_query(sql, max_rows=500) |
Run arbitrary SQL, return JSON rows (max_rows capped at 10,000) |
explain_query(sql) |
Run EXPLAIN on a query and return the query plan |
list_databases(limit=0, offset=0) |
List all available databases (paginated) |
list_tables(database?, limit=0, offset=0) |
List tables in a database (paginated, defaults to configured database) |
search_tables(pattern, database?, limit=0, offset=0) |
Search for tables matching a SQL LIKE pattern (% wildcard) |
describe_table(table) |
Show columns and types for a table |
get_schema_context(tables) |
Batch describe multiple comma-separated tables at once |
profile_table(table, sample_size=100, stats_sample_pct=0) |
One-call profiling: schema, row count, column stats, sample rows. Use stats_sample_pct (1-100) to sample stats on large tables |
generate_snippet(table, operation) |
Generate backend-aware owlbear + Polars Python code (load, filter, aggregate, join) |
show_partitions(table, limit=0, offset=0) |
Show partition keys and values for a partitioned table |
Prompts
| Prompt | Description |
|---|---|
explore_table(table) |
Guided exploration: describe schema, sample rows, suggest queries, summarize |
build_pipeline(table, goal) |
Generate a data pipeline with owlbear boilerplate for a stated goal |
Resource
| URI | Description |
|---|---|
owlbear://config |
Current backend configuration (type, database, region/host) |
Example: Profile a Table
Ask your AI assistant:
Profile the
orderstable
The assistant calls profile_table("orders") and gets back:
{
"table": "orders",
"row_count": 156398,
"columns": [
{"column": "order_id", "type": "bigint", "null_count": 0, "distinct_count": 156398, "min": "1", "max": "156398"},
{"column": "customer_id", "type": "bigint", "null_count": 0, "distinct_count": 8423, "min": "1001", "max": "9999"},
{"column": "order_date", "type": "date", "null_count": 0, "distinct_count": 731, "min": "2023-01-01", "max": "2024-12-31"},
{"column": "amount", "type": "double", "null_count": 12, "distinct_count": 45210, "min": "0.99", "max": "9999.99"},
{"column": "status", "type": "varchar", "null_count": 0, "distinct_count": 4, "min": "cancelled", "max": "shipped"},
{"column": "items", "type": "array<varchar>", "null_count": 85}
],
"sample_rows": [{"order_id": 1, "customer_id": 1001, "order_date": "2023-01-01", "amount": 249.99, "status": "shipped", "items": ["widget-a", "widget-b"]}],
"stats_sampled": false,
"stats_sample_pct": 0
}
Scalar columns (numbers, strings, dates) get distinct_count, min, and max. Complex columns (arrays, maps, structs) get only null_count.
For large tables, use stats_sample_pct to avoid full table scans on the stats query:
Profile the orders table with 10% sampling
The assistant calls profile_table("orders", stats_sample_pct=10) — the stats query uses TABLESAMPLE BERNOULLI(10) while row count remains exact via COUNT(*).
Example: Batch Schema Lookup
What columns do the orders and customers tables have?
The assistant calls get_schema_context("mydb.orders, mydb.customers") and gets schemas for both tables in one response — with per-table error isolation if one fails.
Example: Generate Starter Code
Generate an aggregation snippet for the orders table
The assistant calls generate_snippet("orders", "aggregate") and returns backend-aware Python code using real column names. With OWLBEAR_BACKEND=athena:
from owlbear import AthenaClient
import polars as pl
client = AthenaClient(
database="your_database",
output_location="s3://your-bucket/results/",
)
eid = client.query("SELECT * FROM orders LIMIT 10000")
df = client.results(eid)
result = df.group_by("status").agg(pl.col("amount").mean())
print(result.head())
With OWLBEAR_BACKEND=trino, the snippet uses TrinoClient and the simpler client.query() → DataFrame API (no execution_id / client.results()).
Example: Search for Tables
Find all tables with "order" in the name
The assistant calls search_tables("%order%") and gets back a paginated list of matching tables — useful for discovering tables in large data lakes without knowing exact names.
Example: Show Partitions
What partitions does the events table have?
The assistant calls show_partitions("analytics.events") and gets back partition key/value pairs. On Athena this runs SHOW PARTITIONS, on Trino it queries the $partitions metadata table. Knowing the partition structure helps write cost-efficient queries that avoid full table scans.
Environment Variables
| Variable | Backend | Description | Default |
|---|---|---|---|
OWLBEAR_BACKEND |
both | athena or trino |
athena |
OWLBEAR_DATABASE |
both | Default database/schema | default |
OWLBEAR_S3_OUTPUT_LOCATION |
athena | S3 path for query results | (required) |
AWS_REGION |
athena | AWS region | us-east-1 |
AWS_PROFILE |
athena | AWS profile (via boto3) | — |
OWLBEAR_TRINO_HOST |
trino | Trino hostname | (required) |
OWLBEAR_TRINO_PORT |
trino | Trino port | 443 |
OWLBEAR_TRINO_USER |
trino | Trino user | — |
OWLBEAR_TRINO_CATALOG |
trino | Trino catalog | — |
Example .mcp.json (Athena)
{
"mcpServers": {
"owlbear": {
"command": "owlbear-mcp",
"env": {
"OWLBEAR_DATABASE": "my_database",
"OWLBEAR_S3_OUTPUT_LOCATION": "s3://my-bucket/athena-results/",
"AWS_REGION": "us-east-1",
"AWS_PROFILE": "my-profile"
}
}
}
}
Example .mcp.json (Trino)
{
"mcpServers": {
"owlbear": {
"command": "owlbear-mcp",
"env": {
"OWLBEAR_BACKEND": "trino",
"OWLBEAR_TRINO_HOST": "trino.example.com",
"OWLBEAR_TRINO_PORT": "443",
"OWLBEAR_TRINO_USER": "analyst",
"OWLBEAR_TRINO_CATALOG": "hive",
"OWLBEAR_DATABASE": "default"
}
}
}
}
Type Mapping
Owlbear automatically converts Presto/Trino/Athena SQL types to PyArrow (and then to Polars):
| SQL Type | PyArrow Type |
|---|---|
boolean |
bool_() |
tinyint |
int8() |
smallint |
int16() |
integer |
int32() |
bigint |
int64() |
real / float |
float32() |
double |
float64() |
decimal(p,s) |
decimal128(p, s) |
varchar / char / string |
string() |
varbinary / binary |
binary() |
date |
date32() |
timestamp |
timestamp("us") |
timestamp with time zone |
timestamp("us", tz="UTC") |
time |
time64("us") |
interval day to second |
duration("us") |
interval year to month |
month_day_nano_interval() |
array<T> |
list_(T) |
map<K,V> |
map_(K, V) |
Nested types like array<array<integer>> and map<varchar,array<bigint>> are fully supported.
Configuration
Environment Variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
IAM Permissions
Your AWS credentials need the following permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:StopQueryExecution",
"athena:ListWorkGroups"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::your-athena-results-bucket/*"
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetTable",
"glue:GetPartitions"
],
"Resource": "*"
}
]
}
Testing
pytest tests/ -v
Development
git clone https://github.com/jdonaldson/owlbear.git
cd owlbear
pip install -e ".[dev]"
black . # format
ruff check . # lint
mypy src/ # type check
License
MIT License - see LICENSE file for details.
Contributing
- Fork the repository on GitHub
- Create a feature branch
- Make your changes with tests
- Ensure all tests pass and code is formatted
- Submit a pull request
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file owlbear-0.6.0.tar.gz.
File metadata
- Download URL: owlbear-0.6.0.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40092ab168a263c0185fb2ea7a36b7dd4cb5605a0960218a8d2b4a085748ac8a
|
|
| MD5 |
b278f4aeb432346da6acdf13f1a53753
|
|
| BLAKE2b-256 |
05e9c44af923d8c4cac6b445e455816884bc00c1c60120fc8c8e102672e280f6
|
File details
Details for the file owlbear-0.6.0-py3-none-any.whl.
File metadata
- Download URL: owlbear-0.6.0-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b35a6f7503f76bd4896e394b4bfbf8d9c02bd3369668462b1e977aa034c11c36
|
|
| MD5 |
207845f649ee4f48f7610afda0c4f80d
|
|
| BLAKE2b-256 |
9b3b596b3fd9083cc547f3468d57e39b9e5ef6ecb5e6dcdb53d6e7bb18aea407
|