Skip to main content

Package to make working with duckdb easier

Project description

ezduckdb

Tools to make working with duckdb easier for codenym.

Not intended for general use, but feel free to steal code or ideas.

If you'd be super excited about this being made for general use, reach out.

Acknowledgements

There's a ton code and ideas in here from a dagster blog post

Installation

pip install ezduckdb

Usage

There are 3 classes in this library:

  • S3AwarePath: pathlib.Path + s3 paths
  • SQL: Work with sql files programatically via templating.
  • DuckDB: Connection and Query manager

S3AwarePath

S3AwarePath adds functionality to the pathlib.Path class.

  • is_s3: Is path an s3 path (ie s3://....)
  • get_s3_bucket and get_s3_prefix: Break path for use with boto3
  • Retain s3:// when cast to string (ie in f strings)
  • get_table_name: Get db table name from file name based on codenym convention
    • <schema>_<table>.<extension>
from ezduckdb import S3AwarePath

s3_path = S3AwarePath("s3://bucket/curated/s1chema_table1.csv")
assert inp.get_s3_bucket() == "bucket"
assert inp.get_s3_prefix() == "curated/s1chema_table1.csv"
assert str(inp) == "s3://bucket/curated/s1chema_table1.csv"
assert inp.is_s3()
assert inp.get_table_name() == ("s1chema", "table1")

SQL

SQL enable type based templating for programatical sql query generation for duckdb.

Non-exhaustive list of replacements:

  • pd.DataFrame is converted to df_<id> in the query to enable pandas querying
  • Str are replaced with the string value enclosed in single quotes
  • Int are replaced with the value without quotes
  • SQL replaces recusively for nested querying

Basic

from ezduckdb import SQL

example = SQL("SELECT * FROM $table WHERE id = $id", table="foo", id=1)
assert inp.to_string() == "SELECT * FROM 'foo' WHERE id = 1"

Pandas

from ezduckdb import SQL
import pandas as pd

df = pd.DataFrame({"id": [1, 2, 3]})
inp = SQL("SELECT * FROM $table", table=df)
assert inp.to_string() == "SELECT * FROM df_" + str(id(df))

Nested

from ezduckdb import SQL

example = SQL("SELECT * FROM $table", table=SQL("SELECT * FROM $table", table="foo"))
assert inp.to_string() == "SELECT * FROM (SELECT * FROM 'foo')"

DuckDB

DuckDB is a connection manager for duckdb that has some convenience methods for querying.

  • If s3_storage_used=True then query method will:
    • Load httpfs and aws duckdb extensions
    • call load_aws_credentials passing the aws_profile.
  • query method will:
    • Do all sql templating for SQL object.
    • Return a pd.DataFrame of the results if applicable
  • Provide a context manager for pure sql querying with strings

Templated Querying (Querying with SQL objects)

Basic Querying
from ezduckdb import DuckDB
import pandas as pd

db = DuckDB(s3_storage_used=False)

assert db.query(SQL("select 1")).values == pd.DataFrame([(1,)]).values
Pandas Querying
from ezduckdb import DuckDB
import pandas as pd

db = DuckDB(s3_storage_used=False)
df = pd.DataFrame({"id": [1, 2, 3]})

actual = db.query(SQL("SELECT * FROM $table", table=df))
expected = pd.DataFrame([(1,), (2,), (3,)])
assert (actual.values == expected.values).all()
S3 querying
from ezduckdb import DuckDB
import pandas as pd

db = DuckDB(s3_storage_used=True)
s3_path = "s3://codenym-automated-testing/ezduckdb/parquet/schema1_table1.parquet"

actual = db.query(SQL("SELECT * FROM read_parquet($s3_path)", s3_path=s3_path))
expected = pd.DataFrame([[1, 4], [2, 5], [3, 6]])
assert (actual.values == expected.values).all()

Context Manager (Querying with Strings)

from ezduckdb import DuckDB
import pandas as pd

with DuckDB(s3_storage_used=False) as conn:
    assert conn.query("select 1").df().values == pd.DataFrame([(1,)]).values

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ezduckdb-0.0.8.tar.gz (10.6 kB view hashes)

Uploaded Source

Built Distribution

ezduckdb-0.0.8-py3-none-any.whl (9.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page