Skip to main content

DLT is an open-source python-native scalable data loading framework that does not require any devops efforts to run.

Project description

Quickstart Guide: Data Load Tool (DLT)

TL;DR: This guide shows you how to load a JSON document into Google BigQuery using DLT.

Please open a pull request here if there is something you can improve about this quickstart.

Grab the demo

Clone the example repository:

git clone https://github.com/scale-vector/dlt-quickstart-example.git

Enter the directory:

cd dlt-quickstart-example

Open the files in your favorite IDE / text editor:

  • data.json (i.e. the JSON document you will load)
  • credentials.json (i.e. contains the credentials to our demo Google BigQuery warehouse)
  • quickstart.py (i.e. the script that uses DLT)

Set up a virtual environment

Ensure you are using either Python 3.8 or 3.9:

python3 --version

Create a new virtual environment:

python3 -m venv ./env

Activate the virtual environment:

source ./env/bin/activate

Install DLT and support for the target data warehouse

Install DLT using pip:

pip3 install -U python-dlt

Install support for Google BigQuery:

pip3 install -U python-dlt[gcp]

Understanding the code

  1. Configure DLT

  2. Create a DLT pipeline

  3. Load the data from the JSON document

  4. Pass the data to the DLT pipeline

  5. Use DLT to load the data

Running the code

Run the script:

python3 quickstart.py

Inspect schema.yml that has been generated:

vim schema.yml

See results of querying the Google BigQuery table:

json_doc table

SELECT * FROM `{schema_prefix}_example.json_doc`
{  "name": "Ana",  "age": "30",  "id": "456",  "_dlt_load_id": "1654787700.406905",  "_dlt_id": "5b018c1ba3364279a0ca1a231fbd8d90"}
{  "name": "Bob",  "age": "30",  "id": "455",  "_dlt_load_id": "1654787700.406905",  "_dlt_id": "afc8506472a14a529bf3e6ebba3e0a9e"}

json_doc__children table

SELECT * FROM `{schema_prefix}_example.json_doc__children` LIMIT 1000
    # {"name": "Bill", "id": "625", "_dlt_parent_id": "5b018c1ba3364279a0ca1a231fbd8d90", "_dlt_list_idx": "0", "_dlt_root_id": "5b018c1ba3364279a0ca1a231fbd8d90",
    #   "_dlt_id": "7993452627a98814cc7091f2c51faf5c"}
    # {"name": "Bill", "id": "625", "_dlt_parent_id": "afc8506472a14a529bf3e6ebba3e0a9e", "_dlt_list_idx": "0", "_dlt_root_id": "afc8506472a14a529bf3e6ebba3e0a9e",
    #   "_dlt_id": "9a2fd144227e70e3aa09467e2358f934"}
    # {"name": "Dave", "id": "621", "_dlt_parent_id": "afc8506472a14a529bf3e6ebba3e0a9e", "_dlt_list_idx": "1", "_dlt_root_id": "afc8506472a14a529bf3e6ebba3e0a9e",
    #   "_dlt_id": "28002ed6792470ea8caf2d6b6393b4f9"}
    # {"name": "Elli", "id": "591", "_dlt_parent_id": "5b018c1ba3364279a0ca1a231fbd8d90", "_dlt_list_idx": "1", "_dlt_root_id": "5b018c1ba3364279a0ca1a231fbd8d90",
    #   "_dlt_id": "d18172353fba1a492c739a7789a786cf"}

Joining the two tables above on autogenerated keys (i.e. p._record_hash = c._parent_hash)

select p.name, p.age, p.id as parent_id,
            c.name as child_name, c.id as child_id, c._dlt_list_idx as child_order_in_list
        from `{schema_prefix}_example.json_doc` as p
        left join `{schema_prefix}_example.json_doc__children`  as c
            on p._dlt_id = c._dlt_parent_id
    # {  "name": "Ana",  "age": "30",  "parent_id": "456",  "child_name": "Bill",  "child_id": "625",  "child_order_in_list": "0"}
    # {  "name": "Ana",  "age": "30",  "parent_id": "456",  "child_name": "Elli",  "child_id": "591",  "child_order_in_list": "1"}
    # {  "name": "Bob",  "age": "30",  "parent_id": "455",  "child_name": "Bill",  "child_id": "625",  "child_order_in_list": "0"}
    # {  "name": "Bob",  "age": "30",  "parent_id": "455",  "child_name": "Dave",  "child_id": "621",  "child_order_in_list": "1"}

Next steps

  1. Replace data.json with data you want to explore

  2. Check that the inferred types are correct in schema.yml

  3. Set up your own Google BigQuery warehouse (and replace the credentials)

  4. Use this new clean staging layer as the starting point for a semantic layer / analytical model (e.g. using dbt)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python-dlt-0.1.0rc9.tar.gz (401.5 kB view details)

Uploaded Source

Built Distribution

python_dlt-0.1.0rc9-py3-none-any.whl (461.1 kB view details)

Uploaded Python 3

File details

Details for the file python-dlt-0.1.0rc9.tar.gz.

File metadata

  • Download URL: python-dlt-0.1.0rc9.tar.gz
  • Upload date:
  • Size: 401.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.11 Linux/4.19.128-microsoft-standard

File hashes

Hashes for python-dlt-0.1.0rc9.tar.gz
Algorithm Hash digest
SHA256 0de5b01b1e248968641c797b2006314e37c209a2c70476aca89135bb1debce10
MD5 11cd59d67a15cf62c443280a24020ac6
BLAKE2b-256 722a38a103a6156a6ee5f1417f86cd71950a01dec1a30a918b9fcac2e2831549

See more details on using hashes here.

File details

Details for the file python_dlt-0.1.0rc9-py3-none-any.whl.

File metadata

  • Download URL: python_dlt-0.1.0rc9-py3-none-any.whl
  • Upload date:
  • Size: 461.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.11 Linux/4.19.128-microsoft-standard

File hashes

Hashes for python_dlt-0.1.0rc9-py3-none-any.whl
Algorithm Hash digest
SHA256 7df592d18260e33987d931fd28a4a3bd51b94558e3152af42c8f711e4d1cfb57
MD5 c03ac411504a0821e51da312af7feb1f
BLAKE2b-256 92a59f7029521474d40a3a1c4d74f492638da1e5b00b4c8a51e56f53a5659ad4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page