Scuba on DuckDB

Project description

ScubaDuck

ScubaDuck is a reimplementation of the frontend query interface for Scuba (Meta's internal real-time database system) with DuckDB as the backing database implementation. This implementation misses most of the main value add of Scuba (a distributed, real-time database that supports fast queries), but I also think Scuba's UI for doing queries is great and I have found myself wishing that I have access to it even for "small" databases, e.g., I have a sqlite dataset I want to explore. ScubaDuck is this interface.

This application was entirely vibe coded using OpenAI Codex. It has no third-party JS dependencies; everything was coded from scratch.

Design Philosophy

Time series by default. In the dedicated "time series" view, there are many features specifically oriented towards working towards tables that represent events that occurred over time: the start, end, compare, aggregate and granularity fields all specially privilege the timestamp field. In fact, you can't log events to Scuba's backing data store without a timestamp, they always come with one. ScubaDuck is a little more flexible here (since you might be ingesting arbitrary database tables that weren't specifically designed for realtime databases), but it really shines when you have time series data. This is in contrast to typical software which tries to work with arbitrary data first, with time series being added on later.
It's all about exploration. Scuba is predicated on the idea that you don't know what you're looking for, that you are going to spend time tweaking queries and changing filters/grouping as part of an investigation to figure out why a system behaves the way it is. So the filters/comparisons/groupings you want to edit are always visible on the left sidebar, with the expectation that you're going to tweak the query to look at something else. Similarly, all the parameters of your query get saved into your URL, so your browser history can double up as a query history / you can easily share a query with someone else. This is contrast to typical software which is often oriented to making pretty dashboards and reports. (This function is important too, but it's not what I want in exploration mode!)
You can fix data problems in the query editor. It's pretty common to have messed up and ended up with a database that doesn't have exactly the columns you need, or some columns that are corrupted in some way. Scuba has pretty robust support for defining custom columns with arbitrary SQL functions, grouping over them as if they were native functions, and doing so with minimal runtime cost (Scuba aims to turn around your query in milliseconds!) Having to go and run a huge data pipeline to fix your data is a big impediment to exploration; quick and easy custom columns means you can patch over problems when you're investigating and fix them for real later.

Development

uv sync --frozen
SCUBADUCK_DB=/path/to/foo.sqlite flask --app scubaduck.server run --debug

DuckDB databases and Parquet files work too. Omit to get a simple test dataset, or SCUBADUCK_DB=TEST for a more complicated test dataset.

How to use it

If you don't have a dataset handy, NYC TLC Yellow Taxi Trip Records has convenient PARQUET files that are time series. You could for example run ScubaDuck on January 2025 Yellow Taxi Trip Records (which is where the screenshot above was taken from).

After you've loaded a database and navigate to the application, the home page will pop up the first table in your database and display a hundred samples from it. When you're working with a table for the first time, it's a good idea to look at some samples by hand and get a feel for what's stored in it. (Better is to have designed the table with ScubaDuck visualization in mind.)

Samples lets you look at entries one-by-one, but typically you'll have too many things to look at one by one. Table gives you simple aggregation capabilities: you can use Group by to get some high level aggregate statistics at various breakdowns.

The real juice of Scuba is time series visualization. Assuming your table has a time column, you can plot any numeric column on a graph. With group by, you can split into multiple series, aggregating over each value of the group by. You can easily drill down / drill back up, looking for patterns that only show up on certain splits.

Whenever you Dive, ScubaDuck will update your URL with all of the information for your query. If you save these URLs, you have a durable record of the query you made (assuming your ScubaDuck server is still running!)

Happy diving!

Roadmap

Scuba works best with denormalized tables that don't require JOINs. Scuba at Meta in fact does not support JOINs. DuckDB does support JOINs, so we need to come up with a good UI for exposing them.
There is some extra metadata you might want to persistently save about a database that will help ScubaDuck show it to you in the future. For example, ScubaDuck will always abbreviate integers with SI prefixes, but if an integer is actually an ID of some sort, this is counterproductive and you probably prefer to render the actual ID. Similarly, ScubaDuck tries to infer what the timestamp column is, but if it gets it wrong you'd like to durably save the correct choice. Unfortunately, ScubaDuck as it currently exists is entirely stateless (state lives only in the URL parameters you get when you Dive.) I'm thinking of adding a sidecar file that ScubaDuck can write to next to your database that contains some metadata and then expand the UI to allow for more complicated persistent settings.
Codex is not very smart and mostly implemented things exactly as I asked for it. This means there are many places where there are bugs or lacking polish because I hadn't realized I needed to prompt for it. Bug reports are appreciated, I will feed them to Codex to get them fixed.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scubaduck-0.1.0.tar.gz (1.0 MB view details)

Uploaded May 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scubaduck-0.1.0-py3-none-any.whl (32.8 kB view details)

Uploaded May 23, 2025 Python 3

File details

Details for the file scubaduck-0.1.0.tar.gz.

File metadata

Download URL: scubaduck-0.1.0.tar.gz
Upload date: May 23, 2025
Size: 1.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.2-alpha.1

File hashes

Hashes for scubaduck-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2f0f05df714b8f7ef6517143bc2df887a4f9b7b96dfbf896451574d397fbc73d`
MD5	`e9feb6b6bb68b4d310b86685223a5c2a`
BLAKE2b-256	`913a11bc2c78c972a6623cd0ead2f38c20b431404b65e207ef5a527357e61313`

See more details on using hashes here.

File details

Details for the file scubaduck-0.1.0-py3-none-any.whl.

File metadata

Download URL: scubaduck-0.1.0-py3-none-any.whl
Upload date: May 23, 2025
Size: 32.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.2-alpha.1

File hashes

Hashes for scubaduck-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`725fdcda393b826683155e8bde147274d1c9050ee31abb05cdee9125927ed3ed`
MD5	`3347834b5052cca903fddcfd5f7205d4`
BLAKE2b-256	`4d3014863ec609c4b14e0a2b2d23d2a2e6243bad225824f3ea96a6382b169f2a`

See more details on using hashes here.

scubaduck 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ScubaDuck

Design Philosophy

Development

How to use it

Roadmap

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes