DLT is an open-source python-native scalable data loading framework that does not require any devops efforts to run.
Project description
Quickstart Guide: Data Load Tool (DLT)
TL;DR: This guide shows you how to load a JSON document into Google BigQuery using DLT.
Please open a pull request here if there is something you can improve about this quickstart.
Grab the demo
Clone the example repository:
git clone https://github.com/scale-vector/dlt-quickstart-example.git
Enter the directory:
cd dlt-quickstart-example
Open the files in your favorite IDE / text editor:
data.json
(i.e. the JSON document you will load)credentials.json
(i.e. contains the credentials to our demo Google BigQuery warehouse)quickstart.py
(i.e. the script that uses DLT)
Set up a virtual environment
Ensure you are using either Python 3.8 or 3.9:
python3 --version
Create a new virtual environment:
python3 -m venv ./env
Activate the virtual environment:
source ./env/bin/activate
Install DLT and support for the target data warehouse
Install DLT using pip:
pip3 install -U python-dlt
Install support for Google BigQuery:
pip3 install -U python-dlt[gcp]
Understanding the code
-
Configure DLT
-
Create a DLT pipeline
-
Load the data from the JSON document
-
Pass the data to the DLT pipeline
-
Use DLT to load the data
Running the code
Run the quickstart script in /examples
folder:
python3 quickstart.py
Inspect schema.yml
that has been printed by the script or the generated file:
vim schema.yml
See results of querying the Google BigQuery table:
json_doc
table
SELECT * FROM `{schema_prefix}_example.json_doc`
{ "name": "Ana", "age": "30", "id": "456", "_dlt_load_id": "1654787700.406905", "_dlt_id": "5b018c1ba3364279a0ca1a231fbd8d90"}
{ "name": "Bob", "age": "30", "id": "455", "_dlt_load_id": "1654787700.406905", "_dlt_id": "afc8506472a14a529bf3e6ebba3e0a9e"}
json_doc__children
table
SELECT * FROM `{schema_prefix}_example.json_doc__children` LIMIT 1000
# {"name": "Bill", "id": "625", "_dlt_parent_id": "5b018c1ba3364279a0ca1a231fbd8d90", "_dlt_list_idx": "0", "_dlt_root_id": "5b018c1ba3364279a0ca1a231fbd8d90",
# "_dlt_id": "7993452627a98814cc7091f2c51faf5c"}
# {"name": "Bill", "id": "625", "_dlt_parent_id": "afc8506472a14a529bf3e6ebba3e0a9e", "_dlt_list_idx": "0", "_dlt_root_id": "afc8506472a14a529bf3e6ebba3e0a9e",
# "_dlt_id": "9a2fd144227e70e3aa09467e2358f934"}
# {"name": "Dave", "id": "621", "_dlt_parent_id": "afc8506472a14a529bf3e6ebba3e0a9e", "_dlt_list_idx": "1", "_dlt_root_id": "afc8506472a14a529bf3e6ebba3e0a9e",
# "_dlt_id": "28002ed6792470ea8caf2d6b6393b4f9"}
# {"name": "Elli", "id": "591", "_dlt_parent_id": "5b018c1ba3364279a0ca1a231fbd8d90", "_dlt_list_idx": "1", "_dlt_root_id": "5b018c1ba3364279a0ca1a231fbd8d90",
# "_dlt_id": "d18172353fba1a492c739a7789a786cf"}
Joining the two tables above on autogenerated keys (i.e. p._record_hash = c._parent_hash
)
select p.name, p.age, p.id as parent_id,
c.name as child_name, c.id as child_id, c._dlt_list_idx as child_order_in_list
from `{schema_prefix}_example.json_doc` as p
left join `{schema_prefix}_example.json_doc__children` as c
on p._dlt_id = c._dlt_parent_id
# { "name": "Ana", "age": "30", "parent_id": "456", "child_name": "Bill", "child_id": "625", "child_order_in_list": "0"}
# { "name": "Ana", "age": "30", "parent_id": "456", "child_name": "Elli", "child_id": "591", "child_order_in_list": "1"}
# { "name": "Bob", "age": "30", "parent_id": "455", "child_name": "Bill", "child_id": "625", "child_order_in_list": "0"}
# { "name": "Bob", "age": "30", "parent_id": "455", "child_name": "Dave", "child_id": "621", "child_order_in_list": "1"}
Next steps
-
Replace
data.json
with data you want to explore -
Check that the inferred types are correct in
schema.yml
-
Set up your own Google BigQuery warehouse (and replace the credentials)
-
Use this new clean staging layer as the starting point for a semantic layer / analytical model (e.g. using dbt)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for python_dlt-0.1.0rc15-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2383dcab20f98d7cd4ae06970886fd99f34601752a91511ca4cae5c008a20352 |
|
MD5 | bc87c6a3986406a1eede62be59ee4a00 |
|
BLAKE2b-256 | 471cf2fdc5cadf9ba3dc45c1904e169cf01cb56c48e69b2058452b18effa5fca |