A library that helps you build dlt pipelines by side stepping the dlt library and making your code interactive
Project description
Installation
pip install in your Databricks Notebook
%pip install dlt_sidestep
Example Usage
Note: You must define a pipeline_id
variable as spark.conf.get("pipelines.id", None)
Note: You must define a g
variable as globals()
`
from pyspark.sql.functions import *
from pyspark.sql.types import *
from dlt_sidestep import SideStep
pipeline_id = spark.conf.get("pipelines.id", None)
g = globals()
if pipeline_id:
import dlt
json_path = "/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json"
step = """
@dlt.create_table(
comment="The raw wikipedia click stream dataset, ingested from /databricks-datasets.",
table_properties={
"quality": "bronze"
}
)
def clickstream_raw():
return (
spark.read.option("inferSchema", "true").json(json_path)
)
"""
SideStep(step, pipeline_id, g)
df = clickstream_raw()
df.display()
step = """
@dlt.create_table(
comment="Wikipedia clickstream dataset with cleaned-up datatypes / column names and quality expectations.",
table_properties={
"quality": "silver"
}
)
@dlt.expect("valid_current_page", "current_page_id IS NOT NULL AND current_page_title IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count > 0")
def clickstream_clean():
return (
dlt.read("clickstream_raw")
.withColumn("current_page_id", expr("CAST(curr_id AS INT)"))
.withColumn("click_count", expr("CAST(n AS INT)"))
.withColumn("previous_page_id", expr("CAST(prev_id AS INT)"))
.withColumnRenamed("curr_title", "current_page_title")
.withColumnRenamed("prev_title", "previous_page_title")
.select("current_page_id", "current_page_title", "click_count", "previous_page_id", "previous_page_title")
)
"""
SideStep(step, pipeline_id, g)
df = clickstream_clean()
df.display()
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dlt_sidestep-0.0.8.tar.gz
(4.2 kB
view details)
Built Distribution
File details
Details for the file dlt_sidestep-0.0.8.tar.gz
.
File metadata
- Download URL: dlt_sidestep-0.0.8.tar.gz
- Upload date:
- Size: 4.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d4ef726fbe11ed38ec1ab3021e62aee83a6eba86922f495b72bdfe5d6395121 |
|
MD5 | 5f8677cc5080c67972e2169dc7805082 |
|
BLAKE2b-256 | 0b26f715b7c2d0575626b9a10a61e5479bfb4b1408713f1adde8994346a2c3c2 |
File details
Details for the file dlt_sidestep-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: dlt_sidestep-0.0.8-py3-none-any.whl
- Upload date:
- Size: 3.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f268b13eb581e1dd2ac2376c8e543c1c3be6e8f903bea27b3e47a7a316915e54 |
|
MD5 | 0a200bd9c1f738185be441153cd803ec |
|
BLAKE2b-256 | d33164972a55cad3bdf16d6bb988ff76e5a3f061ddcde0ee04ad6737f9b7d4c8 |