Skip to main content

Read Parquet files in Datasette

Project description

datasette-parquet

PyPI Changelog Tests License

Support Parquet, CSV and JSON Lines files in Datasette. Depends on DuckDB.

Installation

Install this plugin in the same environment as Datasette.

datasette install datasette-parquet

Usage

Say you have a directory of your favourite CSVs, newline-delimited JSON and parquet files that looks like this:

/mnt/files/census.csv
/mnt/files/books.tsv
/mnt/files/tweets.jsonl
/mnt/files/geonames.parquet
/mnt/files/sales/january.parquet
/mnt/files/sales/february.parquet

You can expose these in a Datasette database called trove by something like this in your metadata.json:

{
  "plugins": {
    "datasette-parquet": {
      "trove": {
        "directory": "/mnt/files"
      }
    }
  }
}

Then launch Datasette via datasette --metadata metadata.json

You will have 5 views in the trove database: census, books, tweets, geonames and sales. The sales view will be the union of all the files in that directory -- this works for all of the file types, not just Parquet.

Caveats

Note

You will likely want to disable facet suggestions from the CLI, or install datasette-ui-extras, which disables facet suggestions. See the note on No timeouts for more information.

Warning

You know that old canard, that if it walks like a duck and quacks like a duck, it's probably a duck? This plugin tries to teach DuckDB to walk like SQLite and talk like SQLite. That turns out to be ducking hard! If you come across broken features, let me know and I'll try to fix them up.

  • No timeouts: A core feature of Datasette is that it's safe to let the unwashed masses run arbitrary queries. This is because the data is immutable, and there are timeouts to prevent runaway CPU usage. DuckDB does not currently support timeouts. Think carefully about letting anonymous users use a Datasette instance with this plugin.
  • Joining with existing data: This plugin uses DuckDB, not SQLite. This means that you cannot join against your existing SQLite tables.
  • Read-only: the data in the files can only be queried, not changed.
  • Performance: the files are queried in-place. Performance will be limited by the file type -- parquet files have a zippy binary format, but large CSV and JSONL files might be slow.
  • Facets: DuckDB supports a different set of syntax than SQLite. This means some Datasette features are incompatible, and will be disabled for DuckDB-backed files.

Technical notes

This plugin has a mix of accidental complexity and essential complexity. The essential complexity comes from things like "DuckDB supports a different dialect of SQL". The accidental complexity comes from things like "it's called the Law of Demeter, Colin, not the Strongly Held Opinion of Demeter".

This is a loose journal of things I ran into:

  • DuckDB's Python API is similar to the sqlite3 module's interface, but not the same. Datasette expects to talk to an interface that conforms to sqlite3, so this plugin crufts up some proxy objects to give a "convincing" facade. I mostly YOLOd this part. I wouldn't trust it for write queries, or for reading sensitive data.

    • DuckDB doesn't have the concept of a separate cursor class.
    • sqlite3's cursor is an iterable
    • Datasette uses sqlite3.Row objects, which support indexing by name
    • sqlite3 supports parameterized queries like execute('SELECT :p', {'p': 123}). These need to be rewritten to use numbered parameters and a list.
  • SQLite supports slightly different syntax than DuckDB. We use sqlglot to transpile queries into DuckDB's dialect.

  • Unfortunately, using sqlglot brings its own challenges: it doesn't recognize the GLOB operator, see https://github.com/tobymao/sqlglot/issues/1066

  • Datasette passes extraneous parameters to the sqlite3 connection. A writable canned query will post a csrftoken for security purposes, which ends up as part of the query parameters. DuckDB is strict on the parameters matching the SQL query, so it fails.

  • Datasette expects some SQLite internals to be around, like certain PRAGMA ... functions, or the shape of the EXPLAIN output. We work around this by detecting those queries and telling bald-faced lies to Datasette.

  • Datasette expects json_type(...) to throw a sqlite3.OperationalError on invalid JSON, but DuckDB will (of course) throw its own type: duckdb.InvalidInputException

  • DuckDB is missing some functions from SQLite: json_each(...), date(...)

  • rowid columns in SQLite are stable identifiers. This is not true in DuckDB.

  • SQLite's Python interface supports interrupting long-running queries. DuckDB's C API supports this, too, but it has not yet been exposed to the Python API. See https://github.com/duckdb/duckdb/issues/5938 and https://github.com/duckdb/duckdb/pull/3749

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:

cd datasette-parquet
python3 -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasette-parquet-0.1.tar.gz (11.6 kB view hashes)

Uploaded Source

Built Distribution

datasette_parquet-0.1-py3-none-any.whl (12.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page