Skip to main content

Testing

Project description

Experimental

This is experimental and unstable.

Pyodide + DuckDB

This is a proof of concept at executing duckdb_wasm from a Pyodide kernel. This unlocks a few paths for using duckdb, such as PyScript & JupyterLite.

** The project should probably be called Pyoduckwasm or something like that... it started with JupyterLite as the end goal.

Demonstration:

  • Static PyScript Example

  • PyScript REPL

  • pyodide console

    import micropip;
    await micropip.install('pandas');
    await micropip.install('jupylite-duckdb');
    import jupylite_duckdb as jd;
    conn = await jd.connect();
    r1 = await jd.query("pragma version", conn);
    r2 = await jd.query("create or replace table xyz as select * from 'https://raw.githubusercontent.com/Teradata/kylo/master/samples/sample-data/parquet/userdata2.parquet'", conn);
    r3 = await jd.query("select gender, count(*) as c from xyz group by gender", conn);
    print(r1);
    print(r2);
    print(r3);
    
  • JupyerLite: Open a JupyterLite site, and use the examples from =notebooks

  • JupyterLite Code Console REPL

Note: reloading seems somewhat unreliable with pyodide. CTRL-F5 works more reliably.

Limitations:

  • API: duckdb.connect() and duckdb.query()
  • DataFrames are not (yet) registered in the DuckDB database.
  • Data is copied from the duckdb_wasm arrow result to a python list[dict], and then to a dataframe. PyArrow is not available (yet) in Pyodide.

Observations:

  • It takes about a minute to run the JupyterLite examples. Most of this time is prior to any DuckDB stuff. Some of this time could be shaved off with a custom pyodide build, but PyScript is much faster.
  • JupyterLite was unreliable with page reloads, I ended up having to clear the cache a lot.
  • Not thrilled with PyScript removing the top level await... will probably just auto-wrap it (like ipython %autoawait)

Demonstration

Code Console REPL Example

jupyterlite_duckdb_wasm

Python wrapper to run DuckDB_WASM within JupyterLite with a Pyodide Kernel See notebooks for example of running this within jupyterlite

Cell Magic %%dql

Following the example of magic_duckdb, there's an initial proof of concept for a duckdb for JupyterLite. See Magic Example

Pyodide Console

pyodide console

import micropip;
await micropip.install('pandas');
await micropip.install('jupylite-duckdb');
import jupylite_duckdb as jd;
conn = await jd.connect();
r1 = await jd.query("pragma version", conn);
r2 = await jd.query("create or replace table xyz as select * from 'https://raw.githubusercontent.com/Teradata/kylo/master/samples/sample-data/parquet/userdata2.parquet'", conn);
r3 = await jd.query("select gender, count(*) as c from xyz group by gender", conn);
print(r1);
print(r2);
print(r3);

Various Issues, Todos and Ideas

  • Move examples into our hosted jupyterlite
  • Implement a proof of concept version of dataframe registration
  • Evaluate startup time reduction. Probably will never do this, given PyScript.
  • Handling errors: detect and display errors in Jupyter: too much sfuff buried in console, such as CORS errors
  • invalidate pip browser cache (as/if needed); annoying for development purposes
  • think through async/await/transform_cell approach and whether there's a better solution.
  • Zero copy data exchange (js/duckdb arrow -> python/dataframe and python/df -> js/duckdb): Blocked by Pyarrow support
  • If you're adding local .py files, use importlib.invalidate_caches(). Even then, it was flaky to import.
  • Careful with caching... %pip install will pull from browser cache. I had to clear frequently within dev tools
  • To clear local storage, which is annoyingly persistent, https://superuser.com/questions/519628/clear-html5-local-storage-on-a-specific-page
  • %autoawait is part of why this works in notebooks, which is enabled by default. The %%dql cell magic patches transform-cell to push an await into the cell transformation.: https://ipython.readthedocs.io/en/stable/interactive/autoawait.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

jupylite_duckdb-0.0.18a4-py3-none-any.whl (9.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page