Skip to main content

A library that helps you build dlt pipelines by side stepping the dlt library and making your code interactive

Project description

Installation

pip install in your Databricks Notebook

%pip install dlt_sidestep

Example Usage

Note: You must define a pipeline_id variable as spark.conf.get("pipelines.id", None)

Note: You must define a g variable as globals()

`

from pyspark.sql.functions import *

from pyspark.sql.types import *

from dlt_sidestep import SideStep



pipeline_id =  spark.conf.get("pipelines.id", None)

g = globals()



if pipeline_id:

  import dlt



json_path = "/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json"



step = """

@dlt.create_table(

  comment="The raw wikipedia click stream dataset, ingested from /databricks-datasets.",

  table_properties={

    "quality": "bronze"

  }

)

def clickstream_raw():

  return (

    spark.read.option("inferSchema", "true").json(json_path)

  )

"""

SideStep(step, pipeline_id, g)

df = clickstream_raw()

df.display()





step = """

@dlt.create_table(

  comment="Wikipedia clickstream dataset with cleaned-up datatypes / column names and quality expectations.",

  table_properties={

    "quality": "silver"

  }

)

@dlt.expect("valid_current_page", "current_page_id IS NOT NULL AND current_page_title IS NOT NULL")

@dlt.expect_or_fail("valid_count", "click_count > 0")

def clickstream_clean():

  return (

    dlt.read("clickstream_raw")

      .withColumn("current_page_id", expr("CAST(curr_id AS INT)"))

      .withColumn("click_count", expr("CAST(n AS INT)"))

      .withColumn("previous_page_id", expr("CAST(prev_id AS INT)"))

      .withColumnRenamed("curr_title", "current_page_title")

      .withColumnRenamed("prev_title", "previous_page_title")

      .select("current_page_id", "current_page_title", "click_count", "previous_page_id", "previous_page_title")

  )

"""

SideStep(step, pipeline_id, g)

df = clickstream_clean()

df.display()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dlt_sidestep-0.0.8.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

dlt_sidestep-0.0.8-py3-none-any.whl (3.6 kB view details)

Uploaded Python 3

File details

Details for the file dlt_sidestep-0.0.8.tar.gz.

File metadata

  • Download URL: dlt_sidestep-0.0.8.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.8

File hashes

Hashes for dlt_sidestep-0.0.8.tar.gz
Algorithm Hash digest
SHA256 0d4ef726fbe11ed38ec1ab3021e62aee83a6eba86922f495b72bdfe5d6395121
MD5 5f8677cc5080c67972e2169dc7805082
BLAKE2b-256 0b26f715b7c2d0575626b9a10a61e5479bfb4b1408713f1adde8994346a2c3c2

See more details on using hashes here.

File details

Details for the file dlt_sidestep-0.0.8-py3-none-any.whl.

File metadata

File hashes

Hashes for dlt_sidestep-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f268b13eb581e1dd2ac2376c8e543c1c3be6e8f903bea27b3e47a7a316915e54
MD5 0a200bd9c1f738185be441153cd803ec
BLAKE2b-256 d33164972a55cad3bdf16d6bb988ff76e5a3f061ddcde0ee04ad6737f9b7d4c8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page