Skip to main content

DLT is an open-source python-native scalable data loading framework that does not require any devops efforts to run.

Project description

PyPI version LINT Badge

DLT

DLT enables simple python-native data pipelining for data professionals.

DLT is an open-source python-native scalable data loading framework that does not require any devops efforts to run.

Quickstart guide

How does it work?

DLT aims to simplify data loading for everyone.

To achieve this, we take into account the progressive steps of data pipelining:

1. Data discovery, typing, schema, metadata

When we create a pipeline, we start by grabbing data from the source.

Usually, the source metadata is lacking, so we need to look at the actual data to understand what it is and how to ingest it.

In order to facilitate this, DLT includes several features

  • Auto-unpack nested json if desired
  • generate an inferred schema with data types and load data as-is for inspection in your warehouse.
  • Use an ajusted schema for follow up loads, to better type and filter your data after visual inspection (this also solves dynamic typing of Pandas dfs)

2. Safe, scalable loading

When we load data, many things can intrerupt the process, so we want to make sure we can safely retry without generating artefacts in the data.

Additionally, it's not uncommon to not know the data size in advance, making it a challenge to match data size to loading infrastructure.

With good pipelining design, safe loading becomes a non-issue.

  • Idempotency: The data pipeline supports idempotency on load, so no risk of data duplication.
  • Atomicity: The data is either loaded, or not. Partial loading occurs in the s3/storage buffer, which is then fully committed to warehouse/catalogue once finished. If something fails, the buffer is not partially-commited further.
  • Data-size agnostic: By using generators (like incremental downloading) and online storage as a buffer, it can incrementally process sources of any size without running into worker-machine size limitations.

3. Modelling and analysis

  • Instantiate a dbt package with the source schema, enabling you to skip the dbt setup part and go right to SQL modelling.

4. Data contracts

  • If using an explicit schema, you are able to validate the incoming data against it. Particularly useful when ingesting untyped data such as pandas dataframes, json from apis, documents from nosql etc.

5. Maintenance & Updates

  • Auto schema migration: What do you do when a new field appears, or if it changes type? With auto schema migration you can default to ingest this data, or throw a validation error.

Why?

Data loading is at the base of the data work pyramid.

The current ecosystem of tools follows an old paradigm where the data pipeline creator is a software engineer, while the data pipeline user is an analyst.

In the current world, the data analyst needs to solve problems end to end, including loading.

Currently there are no simple frameworks to achieve this, but only clunky applications that need engineering and devops expertise to run, install, manage and scale. The reason for this is often an artificial monetisation insert (open source but pay to manage).

Additionally, these existing loaders only load data sources for which somebody developed an extractor, requiring a software developer once again.

DLT aims to bring loading into the hands of analysts with none of the unreasonable redundacy waste of the modern data platform.

Additionally, the source schemas will be compatible across the community, creating the possiblity to share reusable analysis and modelling back to the open source community without creating tool-based vendor locks.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python-dlt-0.1.0a0.tar.gz (63.7 kB view details)

Uploaded Source

Built Distribution

python_dlt-0.1.0a0-py3-none-any.whl (86.3 kB view details)

Uploaded Python 3

File details

Details for the file python-dlt-0.1.0a0.tar.gz.

File metadata

  • Download URL: python-dlt-0.1.0a0.tar.gz
  • Upload date:
  • Size: 63.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.11 Linux/4.19.128-microsoft-standard

File hashes

Hashes for python-dlt-0.1.0a0.tar.gz
Algorithm Hash digest
SHA256 456ae2d09d4126241439e90a94693f7e097b8985e0dcde0d2a5703e7f4aff5fb
MD5 342b93e89a2f96ecc3f63fe6a63b9d3b
BLAKE2b-256 0e9f7f1cb577f150c629bf942baf8ce46d2e70fee319fcb6fca4b11b7b5ad511

See more details on using hashes here.

File details

Details for the file python_dlt-0.1.0a0-py3-none-any.whl.

File metadata

  • Download URL: python_dlt-0.1.0a0-py3-none-any.whl
  • Upload date:
  • Size: 86.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.11 Linux/4.19.128-microsoft-standard

File hashes

Hashes for python_dlt-0.1.0a0-py3-none-any.whl
Algorithm Hash digest
SHA256 5733b63519c306c94b4aaa2fd5c7f44b0290e3ff1826d768fb714d309c2e9a5e
MD5 07f8b4b203b45a1ca0db4382f9f4c248
BLAKE2b-256 6e57896122bd06eeb21b59a145ed9334339996c1ddfa4cb666fb3b13098cecb4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page