Skip to main content

Tuberia... when data engineering meets software engineering

Project description

Tuberia logo

Tuberia CI pipeline status Tuberia coverage status Tuberia issues Tuberia contributors Tuberia total downloads Tuberia downloads per month
Data engineering meets software engineering


:books: Documentation: https://aidictive.github.io/tuberia

:keyboard: Source Code: https://github.com/aidictive/tuberia


🤔 What is this?

Tuberia is born from the need to bring the worlds of data and software engineering closer together. Here is a list of common problems in data projects:

  • Loooooong SQL queries impossible to understand/test.
  • A lot of duplicate code due to the difficulty of reusing it in SQL queries.
  • Lack of tests, sometimes because the used framework does not facilitate testing tasks.
  • Lack of documentation.
  • Discrepancies between the existing documentation and the latest deployed code.
  • A set of notebooks deployed under the Databricks Share folder.
  • A generic notebook with utility functions.
  • Use of drag-and-drop frameworks that limit the developer's creativity.
  • Months of intense work to migrate existing pipelines from one orchestrator to another (e.g. from Airflow to Prefect, from Databricks Jobs to Data Factory...).

Tuberia aims to solve all these problems and many others.

🤓 How it works?

You can view Tuberia as if it were a compiler. Instead of compiling a programming language, it compiles the steps necessary for your data pipeline to run successfully.

Tuberia is not an orchestrator, but it allows you to run the code you write in Python in any existing orchestrator: Airflow, Prefect, Databricks Jobs, Data Factory....

Tuberia provides some abstraction of where the code is executed, but defines very well what are the necessary steps to execute it. For example, this shows how to create a PySpark DataFrame from the range function and creates a Delta table.

import pyspark.sql.functions as F

from tuberia import PySparkTable, run


class Range(PySparkTable):
    """Table with numbers from 1 to `n`.

    Attribute:
        n: Max number in table.

    """
    n: int = 10

    def df(self):
        return self.spark.range(self.n).withColumn("id", F.col(self.schema.id)


class DoubleRange(PySparkTable):
    range: Range = Range()

    def df(self):
        return self.range.read().withColumn("id", F.col("id") * 2)


run(DoubleRange())

!!! warning

Previous code may not work yet and it can change. Please, notice this
project is in an early stage of its development.

All docstrings included in the code will be used to generate documentation about your data pipeline. That information, together with the result of data expectations/data quality rules will help you to always have complete and up to date documentation.

Besides that, as you have seen, Tuberia is pure Python so doing unit tests/data tests is very easy. Programming gurus will enjoy data engineering again!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tuberia-0.0.1.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

tuberia-0.0.1-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file tuberia-0.0.1.tar.gz.

File metadata

  • Download URL: tuberia-0.0.1.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.0 CPython/3.10.8 Linux/5.15.0-1022-azure

File hashes

Hashes for tuberia-0.0.1.tar.gz
Algorithm Hash digest
SHA256 d58fa43aab538ef13f71ae4a83b01ec841a74e6a778344b4481185570ea58a1c
MD5 474a5bd578df3a9df07421d8eab1f771
BLAKE2b-256 b17d72084e69ccf97dbd54bb78a57b7f005a81dc2412c7a3debd0ed86732d998

See more details on using hashes here.

File details

Details for the file tuberia-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: tuberia-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.0 CPython/3.10.8 Linux/5.15.0-1022-azure

File hashes

Hashes for tuberia-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 129079a6f23d82b07d272bfee50f4a80d008dfe5f613281701cfdd043ee09f6f
MD5 df5f07e563607d34c8f9b746dc949b5e
BLAKE2b-256 6c34bf21bec5190c877f2427fd928d5eec9c02b9c67c686d67995c29c937aaf7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page