Skip to main content

Package to make working with duckdb easier

Project description

ezduckdb

Tools to make working with duckdb easier for codenym.

Not intended for general use, but feel free to steal code or ideas.

If you'd be super excited about this being made for general use, reach out.

Acknowledgements

There's a ton code and ideas in here from a dagster blog post

Installation

pip install ezduckdb

Usage

There are 3 classes in this library:

  • S3AwarePath: pathlib.Path + s3 paths
  • SQL: Work with sql files programatically via templating.
  • DuckDB: Connection and Query manager

S3AwarePath

S3AwarePath adds functionality to the pathlib.Path class.

  • is_s3: Is path an s3 path (ie s3://....)
  • get_s3_bucket and get_s3_prefix: Break path for use with boto3
  • Retain s3:// when cast to string (ie in f strings)
  • get_table_name: Get db table name from file name based on codenym convention
    • <schema>_<table>.<extension>
from ezduckdb import S3AwarePath

s3_path = S3AwarePath("s3://bucket/curated/s1chema_table1.csv")
assert inp.get_s3_bucket() == "bucket"
assert inp.get_s3_prefix() == "curated/s1chema_table1.csv"
assert str(inp) == "s3://bucket/curated/s1chema_table1.csv"
assert inp.is_s3()
assert inp.get_table_name() == ("s1chema", "table1")

SQL

SQL enable type based templating for programatical sql query generation for duckdb.

Non-exhaustive list of replacements:

  • pd.DataFrame is converted to df_<id> in the query to enable pandas querying
  • Str are replaced with the string value enclosed in single quotes
  • Int are replaced with the value without quotes
  • SQL replaces recusively for nested querying

Basic

from ezduckdb import SQL

example = SQL("SELECT * FROM $table WHERE id = $id", table="foo", id=1)
assert inp.to_string() == "SELECT * FROM 'foo' WHERE id = 1"

Pandas

from ezduckdb import SQL
import pandas as pd

df = pd.DataFrame({"id": [1, 2, 3]})
inp = SQL("SELECT * FROM $table", table=df)
assert inp.to_string() == "SELECT * FROM df_" + str(id(df))

Nested

from ezduckdb import SQL

example = SQL("SELECT * FROM $table", table=SQL("SELECT * FROM $table", table="foo"))
assert inp.to_string() == "SELECT * FROM (SELECT * FROM 'foo')"

DuckDB

DuckDB is a connection manager for duckdb that has some convenience methods for querying.

  • If s3_storage_used=True then query method will:
    • Load httpfs and aws duckdb extensions
    • call load_aws_credentials passing the aws_profile.
  • query method will:
    • Do all sql templating for SQL object.
    • Return a pd.DataFrame of the results if applicable
  • Provide a context manager for pure sql querying with strings

Templated Querying (Querying with SQL objects)

Basic Querying
from ezduckdb import DuckDB
import pandas as pd

db = DuckDB(s3_storage_used=False)

assert db.query(SQL("select 1")).values == pd.DataFrame([(1,)]).values
Pandas Querying
from ezduckdb import DuckDB
import pandas as pd

db = DuckDB(s3_storage_used=False)
df = pd.DataFrame({"id": [1, 2, 3]})

actual = db.query(SQL("SELECT * FROM $table", table=df))
expected = pd.DataFrame([(1,), (2,), (3,)])
assert (actual.values == expected.values).all()
S3 querying
from ezduckdb import DuckDB
import pandas as pd

db = DuckDB(s3_storage_used=True)
s3_path = "s3://codenym-automated-testing/ezduckdb/parquet/schema1_table1.parquet"

actual = db.query(SQL("SELECT * FROM read_parquet($s3_path)", s3_path=s3_path))
expected = pd.DataFrame([[1, 4], [2, 5], [3, 6]])
assert (actual.values == expected.values).all()

Context Manager (Querying with Strings)

from ezduckdb import DuckDB
import pandas as pd

with DuckDB(s3_storage_used=False) as conn:
    assert conn.query("select 1").df().values == pd.DataFrame([(1,)]).values

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ezduckdb-0.0.8.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ezduckdb-0.0.8-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file ezduckdb-0.0.8.tar.gz.

File metadata

  • Download URL: ezduckdb-0.0.8.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for ezduckdb-0.0.8.tar.gz
Algorithm Hash digest
SHA256 37500810eb9f98b89118ea1ad7d858a514e6f9d21570cc2fa21bf214af47583b
MD5 389b403bce809f9cc762b2c9635cda08
BLAKE2b-256 3b86f78d8417c6d9c6794364f9401e4de2e315febd3536c3d426dd1cedbbeab7

See more details on using hashes here.

File details

Details for the file ezduckdb-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: ezduckdb-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for ezduckdb-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 8ce3debef3b3f484cca994b52d216a8ca053efbcc792fb45d5d700b688c35f44
MD5 d6d72c982d3f3c2ffdc0de3c44efffd8
BLAKE2b-256 be46e22b44e7d65d395292733a6da3e6e1444138e1ea3b01fd28342734dc805b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page