Skip to main content

Data Catalogs Made Easy

Project description

logo

Build a custom data-catalog in minutes



🔍️ 1. What is CatalogBuilder?

  • CatalogBuilder is a simple tool to generate & deploy a documentation website for your data assets.
  • It enables anyone at your company to quickly find the trusted data they are looking for.

💡 2. Why CatalogBuilder?

There are many open-source projects (admundsen, open-metadata, datahub, metacat, atlas) to build such a catalog in-house. But as they offer a lot of advanced features, they are hard to manage and deploy if you're not a tech expert. They can be even harder to customize.

dbt docs is great to generate a documentation website on top of your dbt assets but:

  • it focuses on dbt only (while you are interested in other sources + metadata)
  • is very hard to customize (except you're an angular expert)
  • can be slow.

👉 CatalogBuilder aims at offering a lightweight alternative to generate a documentation website on top of your data assets. It focuses on read-only data discovery and:

  1. ✔️ can be easily customized and deployed by low tech people
  2. ✔️ can then handle the very specific needs of your company
  3. ✔️ is fast and lightweight
  4. ✔️ is built on top of the very famous mkdocs-material python library which is used by millions of developers to deploy their documentation (such as fastapi).

💥 3. Getting Started with catalog CLI

catalog is the CLI (command-line-interface) of CatalogBuilder to generate, show & deploy the documentation.

3.1 Install catalog CLI 🛠️

pip install catalog-builder

3.2 Create your first documentation configuration 👨‍💻

catalog download dbt_gitlab_data_team

To get started, let's download a catalog configuration example from the GitHub repo and play with it. The above command will download the catalogs/dbt_gitlab_data_team folder on your laptop.

You will find in the folder:

  • assets file: a file containing the list of the assets you want to put in your documentation. It can be a parquet file named assets.parquet or a json lines file named assets.jsonl. Each asset in the file must have the following fields:
    • asset_type: for example: table.
    • documentation_path: the path of the asset page in the generated documentation. For example dataset_name/table_name.
    • data: a dict of attributes used to generate the documentation. For example {"name": "foo"}
  • generate_assets_file.py: the python script used to (re)generate the assets file.
  • requirements.txt: the python requirements needed by generate_assets_file.py.
  • templates: a folder which includes a jinja-template markdown-file for each asset_type. These templates are used to generate a markdown documentation file for each asset.
  • source_docs: a folder which includes files to include as-is in the documentation.
  • mkdocs.yml: the mkdocs configuration file used by mkdocs to build the documentation website from the generated markdown files.

3.3 Build your catalog website 👾

catalog build dbt_gitlab_data_team
  1. For each asset of the assets file, the jinja template of asset_type will be rendered using the asset data to generate a markdown file which will be written into catalogs/dbt_gitlab_data_team/docs/ at documentation_path.
  2. All files in catalogs/dbt_gitlab_data_team/source_docs/ are copied into catalogs/dbt_gitlab_data_team/docs/
  3. Mkdocs will then build the documentation website from the markdown files into catalogs/dbt_gitlab_data_team/site (using mkdocs.yml configuration file).

3.4 Run your catalog website locally ⚡

catalog serve dbt_gitlab_data_team

You can now see the generated documentation website at http://localhost:8000.

3.5 Deploy the documentation website! 🚀

A. To deploy on GitHub pages:

catalog deploy github-pages dbt_gitlab_data_team

Mkdocs will deploy the site on GitHub pages (this only works if you are on a github repository).

B. To deploy on Google Cloud Storage Bucket:

catalog deploy gcs dbt_gitlab_data_team

Mkdocs will copy all the files in catalogs/dbt_gitlab_data_team/site to the bucket defined by site_url value of catalogs/dbt_gitlab_data_team/mkdocs.yml. For instance if the site url is http://catalogs.unytics.io/dbt_gitlab_data_team/ it will copy all files under catalogs/dbt_gitlab_data_team/site to gs://catalogs.unytics.io/dbt_gitlab_data_team/

C. To deploy elsewhere:

You can follow these instructions from mkdocs.


💎 4. Generate your dbt documentation

To generate a documentation website for your own dbt project, do the following:

  1. Change directory to your dbt project directory
  2. Download catalogs/dbt documentation example by running catalog download dbt.
  3. Run dbt docs generate to compute target/manifest.json and target/catalog.json.
  4. Generate the assets file by running python catalogs/dbt/generate_assets_file.py. The script will parse target/manifest.json and target/catalog.json to generate the assets file in the expected format.
  5. Run catalog serve dbt to build the website and show it locally.

Keep in touch 🧑‍💻

Join our Slack for any question, to get help for getting started, to speak about a bug, to suggest improvements, or simply if you want to have a chat 🙂.


👋 Contribute

Any contribution is more than welcome 🤗!

  • Add a ⭐ on the repo to show your support
  • Join our Slack and talk with us
  • Raise an issue to raise a bug or suggest improvements
  • Open a PR!
<style> .md-sidebar--primary { display: none!important; } :root { --md-primary-fg-color: #2acfa7ff!important; } </style>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catalog_builder-0.7.tar.gz (8.7 kB view details)

Uploaded Source

File details

Details for the file catalog_builder-0.7.tar.gz.

File metadata

  • Download URL: catalog_builder-0.7.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for catalog_builder-0.7.tar.gz
Algorithm Hash digest
SHA256 9edcabe9b90cce4201272ec21524e9067c2dce214eb270d15e84fa6c0ba578e0
MD5 bc72d4d6428c57ab817f3cd2f2786609
BLAKE2b-256 d293acc18ae5dca6ff00f52766f37bb92a4a77b30264f6448eebab0a78e5f739

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page