Skip to main content

DLT is an open-source python-native scalable data loading framework that does not require any devops efforts to run.

Project description

Quickstart Guide: Data Load Tool (DLT)

TL;DR: This guide shows you how to load a JSON document into Google BigQuery using DLT.

Please open a pull request here if there is something you can improve about this quickstart.

Grab the demo

Clone the example repository:

git clone https://github.com/scale-vector/dlt-quickstart-example.git

Enter the directory:

cd dlt-quickstart-example

Open the files in your favorite IDE / text editor:

  • data.json (i.e. the JSON document you will load)
  • credentials.json (i.e. contains the credentials to our demo Google BigQuery warehouse)
  • quickstart.py (i.e. the script that uses DLT)

Set up a virtual environment

Ensure you are using either Python 3.8 or 3.9:

python3 --version

Create a new virtual environment:

python3 -m venv ./env

Activate the virtual environment:

source ./env/bin/activate

Install DLT and support for the target data warehouse

Install DLT using pip:

pip3 install -U python-dlt

Install support for Google BigQuery:

pip3 install -U python-dlt[gcp]

Understanding the code

  1. Configure DLT

  2. Create a DLT pipeline

  3. Load the data from the JSON document

  4. Pass the data to the DLT pipeline

  5. Use DLT to load the data

Running the code

Run the quickstart script in /examples folder:

python3 quickstart.py

Inspect schema.yml that has been printed by the script or the generated file:

vim schema.yml

See results of querying the Google BigQuery table:

json_doc table

SELECT * FROM `{schema_prefix}_example.json_doc`
{  "name": "Ana",  "age": "30",  "id": "456",  "_dlt_load_id": "1654787700.406905",  "_dlt_id": "5b018c1ba3364279a0ca1a231fbd8d90"}
{  "name": "Bob",  "age": "30",  "id": "455",  "_dlt_load_id": "1654787700.406905",  "_dlt_id": "afc8506472a14a529bf3e6ebba3e0a9e"}

json_doc__children table

SELECT * FROM `{schema_prefix}_example.json_doc__children` LIMIT 1000
    # {"name": "Bill", "id": "625", "_dlt_parent_id": "5b018c1ba3364279a0ca1a231fbd8d90", "_dlt_list_idx": "0", "_dlt_root_id": "5b018c1ba3364279a0ca1a231fbd8d90",
    #   "_dlt_id": "7993452627a98814cc7091f2c51faf5c"}
    # {"name": "Bill", "id": "625", "_dlt_parent_id": "afc8506472a14a529bf3e6ebba3e0a9e", "_dlt_list_idx": "0", "_dlt_root_id": "afc8506472a14a529bf3e6ebba3e0a9e",
    #   "_dlt_id": "9a2fd144227e70e3aa09467e2358f934"}
    # {"name": "Dave", "id": "621", "_dlt_parent_id": "afc8506472a14a529bf3e6ebba3e0a9e", "_dlt_list_idx": "1", "_dlt_root_id": "afc8506472a14a529bf3e6ebba3e0a9e",
    #   "_dlt_id": "28002ed6792470ea8caf2d6b6393b4f9"}
    # {"name": "Elli", "id": "591", "_dlt_parent_id": "5b018c1ba3364279a0ca1a231fbd8d90", "_dlt_list_idx": "1", "_dlt_root_id": "5b018c1ba3364279a0ca1a231fbd8d90",
    #   "_dlt_id": "d18172353fba1a492c739a7789a786cf"}

Joining the two tables above on autogenerated keys (i.e. p._record_hash = c._parent_hash)

select p.name, p.age, p.id as parent_id,
            c.name as child_name, c.id as child_id, c._dlt_list_idx as child_order_in_list
        from `{schema_prefix}_example.json_doc` as p
        left join `{schema_prefix}_example.json_doc__children`  as c
            on p._dlt_id = c._dlt_parent_id
    # {  "name": "Ana",  "age": "30",  "parent_id": "456",  "child_name": "Bill",  "child_id": "625",  "child_order_in_list": "0"}
    # {  "name": "Ana",  "age": "30",  "parent_id": "456",  "child_name": "Elli",  "child_id": "591",  "child_order_in_list": "1"}
    # {  "name": "Bob",  "age": "30",  "parent_id": "455",  "child_name": "Bill",  "child_id": "625",  "child_order_in_list": "0"}
    # {  "name": "Bob",  "age": "30",  "parent_id": "455",  "child_name": "Dave",  "child_id": "621",  "child_order_in_list": "1"}

Next steps

  1. Replace data.json with data you want to explore

  2. Check that the inferred types are correct in schema.yml

  3. Set up your own Google BigQuery warehouse (and replace the credentials)

  4. Use this new clean staging layer as the starting point for a semantic layer / analytical model (e.g. using dbt)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_dlt-0.2.0a10.tar.gz (486.8 kB view details)

Uploaded Source

Built Distribution

python_dlt-0.2.0a10-py3-none-any.whl (571.5 kB view details)

Uploaded Python 3

File details

Details for the file python_dlt-0.2.0a10.tar.gz.

File metadata

  • Download URL: python_dlt-0.2.0a10.tar.gz
  • Upload date:
  • Size: 486.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.8.11 Linux/4.19.128-microsoft-standard

File hashes

Hashes for python_dlt-0.2.0a10.tar.gz
Algorithm Hash digest
SHA256 f02d4c5fb636b32ecc5c23424516b559c4dfcde92909fb5f83b1610ff3d67813
MD5 e7ad8ab141468e331a1cdcb9b599799a
BLAKE2b-256 5ffcda387143ca9d711dbf5d7a6adfc24ce0b03b47907e57cd6ffa044c250b2d

See more details on using hashes here.

File details

Details for the file python_dlt-0.2.0a10-py3-none-any.whl.

File metadata

  • Download URL: python_dlt-0.2.0a10-py3-none-any.whl
  • Upload date:
  • Size: 571.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.8.11 Linux/4.19.128-microsoft-standard

File hashes

Hashes for python_dlt-0.2.0a10-py3-none-any.whl
Algorithm Hash digest
SHA256 ea9c037854146da45f024cbb37971d1d3a53f1572972b9069f2a308a9e498690
MD5 a8fecf1e3dba8cdbf94404acb7c49d80
BLAKE2b-256 f7e4c35e157c2f8dd2397b503d93f58ada3ee9f055a79af3afe950affac45ff3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page