DLT is an open-source python-native scalable data loading framework that does not require any devops efforts to run.
Project description
Quickstart Guide: Data Load Tool (DLT)
TL;DR: This guide shows you how to load a JSON document into Google BigQuery using DLT.
Please open a pull request here if there is something you can improve about this quickstart.
Grab the demo
Clone the example repository:
git clone https://github.com/scale-vector/dlt-quickstart-example.git
Enter the directory:
cd dlt-quickstart-example
Open the files in your favorite IDE / text editor:
data.json(i.e. the JSON document you will load)credentials.json(i.e. contains the credentials to our demo Google BigQuery warehouse)quickstart.py(i.e. the script that uses DLT)
Set up a virtual environment
Ensure you are using either Python 3.8 or 3.9:
python3 --version
Create a new virtual environment:
python3 -m venv ./env
Activate the virtual environment:
source ./env/bin/activate
Install DLT and support for the target data warehouse
Install DLT using pip:
pip3 install -U python-dlt
Install support for Google BigQuery:
pip3 install -U python-dlt[gcp]
Understanding the code
-
Configure DLT
-
Create a DLT pipeline
-
Load the data from the JSON document
-
Pass the data to the DLT pipeline
-
Use DLT to load the data
Running the code
Run the script:
python3 quickstart.py
Inspect schema.yml that has been generated:
vim schema.yml
See results of querying the Google BigQuery table:
json_doc table
SELECT * FROM `{schema_prefix}_example.json_doc`
{ "name": "Ana", "age": "30", "id": "456", "_dlt_load_id": "1654787700.406905", "_dlt_id": "5b018c1ba3364279a0ca1a231fbd8d90"}
{ "name": "Bob", "age": "30", "id": "455", "_dlt_load_id": "1654787700.406905", "_dlt_id": "afc8506472a14a529bf3e6ebba3e0a9e"}
json_doc__children table
SELECT * FROM `{schema_prefix}_example.json_doc__children` LIMIT 1000
# {"name": "Bill", "id": "625", "_dlt_parent_id": "5b018c1ba3364279a0ca1a231fbd8d90", "_dlt_list_idx": "0", "_dlt_root_id": "5b018c1ba3364279a0ca1a231fbd8d90",
# "_dlt_id": "7993452627a98814cc7091f2c51faf5c"}
# {"name": "Bill", "id": "625", "_dlt_parent_id": "afc8506472a14a529bf3e6ebba3e0a9e", "_dlt_list_idx": "0", "_dlt_root_id": "afc8506472a14a529bf3e6ebba3e0a9e",
# "_dlt_id": "9a2fd144227e70e3aa09467e2358f934"}
# {"name": "Dave", "id": "621", "_dlt_parent_id": "afc8506472a14a529bf3e6ebba3e0a9e", "_dlt_list_idx": "1", "_dlt_root_id": "afc8506472a14a529bf3e6ebba3e0a9e",
# "_dlt_id": "28002ed6792470ea8caf2d6b6393b4f9"}
# {"name": "Elli", "id": "591", "_dlt_parent_id": "5b018c1ba3364279a0ca1a231fbd8d90", "_dlt_list_idx": "1", "_dlt_root_id": "5b018c1ba3364279a0ca1a231fbd8d90",
# "_dlt_id": "d18172353fba1a492c739a7789a786cf"}
Joining the two tables above on autogenerated keys (i.e. p._record_hash = c._parent_hash)
select p.name, p.age, p.id as parent_id,
c.name as child_name, c.id as child_id, c._dlt_list_idx as child_order_in_list
from `{schema_prefix}_example.json_doc` as p
left join `{schema_prefix}_example.json_doc__children` as c
on p._dlt_id = c._dlt_parent_id
# { "name": "Ana", "age": "30", "parent_id": "456", "child_name": "Bill", "child_id": "625", "child_order_in_list": "0"}
# { "name": "Ana", "age": "30", "parent_id": "456", "child_name": "Elli", "child_id": "591", "child_order_in_list": "1"}
# { "name": "Bob", "age": "30", "parent_id": "455", "child_name": "Bill", "child_id": "625", "child_order_in_list": "0"}
# { "name": "Bob", "age": "30", "parent_id": "455", "child_name": "Dave", "child_id": "621", "child_order_in_list": "1"}
Next steps
-
Replace
data.jsonwith data you want to explore -
Check that the inferred types are correct in
schema.yml -
Set up your own Google BigQuery warehouse (and replace the credentials)
-
Use this new clean staging layer as the starting point for a semantic layer / analytical model (e.g. using dbt)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file python-dlt-0.1.0rc6.tar.gz.
File metadata
- Download URL: python-dlt-0.1.0rc6.tar.gz
- Upload date:
- Size: 399.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.8.11 Linux/4.19.128-microsoft-standard
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58ad0f9d76b159b08a99c3c0581a9bc83bb86fbca2f13d7cfce1706a6accede7
|
|
| MD5 |
72cc407efe812e9f2ef1e8ca586183b2
|
|
| BLAKE2b-256 |
5909703f6aaf2b4a669254d3255419379ee5b1cfbba13d3de3f90dbb80841732
|
File details
Details for the file python_dlt-0.1.0rc6-py3-none-any.whl.
File metadata
- Download URL: python_dlt-0.1.0rc6-py3-none-any.whl
- Upload date:
- Size: 458.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.8.11 Linux/4.19.128-microsoft-standard
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b4e82f582e194bea16f274c46c8484f7affd600d3a985dbed6234808d71d314
|
|
| MD5 |
bbbbb6fe1675d1216b7e6de21707472b
|
|
| BLAKE2b-256 |
73d7640a80f126754c89e870808d2092e4fad760f9bf3be75666c513a2e9c9b9
|