Skip to main content

Your data catalog as code and one schema to rule them all.

Project description

hela: write your data catalog as code

Unit Tests Spark BigQuery AWS Glue

You probably already have your data job scripts version controlled, but what about your data catalog? The answer: write your data catalog as code! Storing your data catalog and data documentation as code makes your catalog searchable, referenceable, reliable, platform agnostic, sets you up for easy collaboration and much more! This library is built to fit small and large data landscapes, but is happiest when included from the start.

Hela (or Hel) is the norse mythological collector of souls, and the Swedish word for "whole" or "all of it". Hela is designed to give everyone a chance to build a data catalog, with a low entry barrier: pure python code.

Links:

Installing

Using pip:

pip install hela

Using poetry:

poetry add hela

Roadmap

These are up-coming features in no particular order, but contributions towards these milestones are highly appreciated! To read more about contributing check out CONTRIBUTING.md.

  • Search functionality in web app
  • More integrations (Snowflake, Redshift)
  • More feature rich dataset classes
  • Data lineage functionality (both visualized in notebooks and web app)
  • Prettier docs page

(Mega) Quick start

If you want to read more check out the docs page. If you do not have patience for that, the following is all you need to get started.

First of all build your own dataset class by inheriting the BaseDataset class. This class will hold most of your project specific functionality such as read/write, authentication etc.

class MyDatasetClass(BaseDataset):
    def __init__(
        self,
        name: str,  # Required
        description: str,  # Optional but recommended
        columns: list,  # Optional but recommended
        rich_description_path: str = None,  # Optional, used for web app
        partition_cols: list = None,  # Optional but recommended
        # folder: str = None, # Only do one of either folder or database
        database: str = None,  # Optional, can also be enriched via Catalog
    ) -> None:
        super().__init__(
            name,
            data_type='bigquery',
            folder=None,
            database=database,
            description=description,
            rich_description_path=rich_description_path,
            partition_cols=partition_cols,
            columns=columns
        )
        # Do more of your own init stuff

    def my_func(self) -> None:
        # Your own dataset function
        pass

# Now instantiate your dataset class with one example column
my_dataset = MyDatasetClass('my_dataset', 'An example dataset.', [
    Col('my_column', String(), 'An example column.')
])

Now that you have a dataset class, and instantiated your first dataset, you can start populating your data catalog.

from hela import Catalog

class MyCatalog(Catalog):
    my_dataset = my_dataset

That's it! You now have a small catalog to keep building on. To view it as a web page you can add the following code to a python script, and in the future add it in whichever CI/CD tool you use. This will generate an index.html file that you can view in your browser or host on e.g. github pages.

from hela import generate_webpage

generate_webpage(MyCatalog, output_folder='.')

To view what a bigger data catalog can look like check out the showcase catalog.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hela-0.2.6.tar.gz (161.1 kB view details)

Uploaded Source

Built Distribution

hela-0.2.6-py3-none-any.whl (169.8 kB view details)

Uploaded Python 3

File details

Details for the file hela-0.2.6.tar.gz.

File metadata

  • Download URL: hela-0.2.6.tar.gz
  • Upload date:
  • Size: 161.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/22.4.0

File hashes

Hashes for hela-0.2.6.tar.gz
Algorithm Hash digest
SHA256 fc65f507df73819bb75889d7dd3b5214d412d4c00137dec43004cb20f00a94ab
MD5 d96c118c2f1679dce09454639e69abc5
BLAKE2b-256 117dd720145d8797bfa53db4cc5308f80234e8d1ac7b3f5f1c3ebb5f86584e90

See more details on using hashes here.

File details

Details for the file hela-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: hela-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 169.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/22.4.0

File hashes

Hashes for hela-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2b0e6f6384cf8682008d36c0cc0f3f0846c3887b8a948a021c8c94e98fd0935b
MD5 cbeff1c4197a844ed9cebdc5826f3fd7
BLAKE2b-256 91365cfa01e8e47814bc99a44e3e03d2419b781add57c5a7a5400f0153d20456

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page