Skip to main content

Data Catalogs Made Easy

Project description

logo

Build a custom data-catalog in minutes



🔍️ 1. What is CatalogBuilder?

  • CatalogBuilder is a simple tool to generate & deploy a documentation website for your data assets.
  • It enables anyone at your company to quickly find the trusted data they are looking for.

💡 2. Why CatalogBuilder?

There are many open-source projects (admundsen, open-metadata, datahub, metacat, atlas) to build such a catalog in-house. But as they offer a lot of advanced features, they are hard to manage and deploy if you're not a tech expert. They can be even harder to customize.

dbt docs is great to generate a documentation website on top of your dbt assets but:

  • it focuses on dbt only (while you are interested in other sources + metadata)
  • is very hard to customize (except you're an angular expert)
  • can be slow.

👉 CatalogBuilder aims at offering a lightweight alternative to generate a documentation website on top of your data assets. It focuses on read-only data discovery and:

  1. ✔️ can be easily customized and deployed by low tech people
  2. ✔️ can then handle the very specific needs of your company
  3. ✔️ is fast and lightweight
  4. ✔️ is built on top of the very famous mkdocs-material python library which is used by millions of developers to deploy their documentation (such as fastapi).

💥 3. Getting Started with catalog CLI

catalog is the CLI (command-line-interface) of CatalogBuilder to generate, show & deploy the documentation.

3.1 Install catalog CLI 🛠️

pip install catalog-builder

3.2 Create your first documentation configuration 👨‍💻

catalog download dbt_gitlab_data_team

To get started, let's download a catalog configuration example from the GitHub repo and play with it. The above command will download the catalogs/dbt_gitlab_data_team folder on your laptop.

You will find in the folder:

  • assets file: a file containing the list of the assets you want to put in your documentation. It can be a parquet file named assets.parquet or a json lines file named assets.jsonl. Each asset in the file must have the following fields:
    • asset_type: for example: table.
    • documentation_path: the path of the asset page in the generated documentation. For example dataset_name/table_name.
    • data: a dict of attributes used to generate the documentation. For example {"name": "foo"}
  • generate_assets_file.py: the python script used to (re)generate the assets file.
  • requirements.txt: the python requirements needed by generate_assets_file.py.
  • templates: a folder which includes a jinja-template markdown-file for each asset_type. These templates are used to generate a markdown documentation file for each asset.
  • source_docs: a folder which includes files to include as-is in the documentation.
  • mkdocs.yml: the mkdocs configuration file used by mkdocs to build the documentation website from the generated markdown files.

3.3 Build your catalog website 👾

catalog build dbt_gitlab_data_team
  1. For each asset of the assets file, the jinja template of asset_type will be rendered using the asset data to generate a markdown file which will be written into catalogs/dbt_gitlab_data_team/docs/ at documentation_path.
  2. All files in catalogs/dbt_gitlab_data_team/source_docs/ are copied into catalogs/dbt_gitlab_data_team/docs/
  3. Mkdocs will then build the documentation website from the markdown files into catalogs/dbt_gitlab_data_team/site (using mkdocs.yml configuration file).

3.4 Run your catalog website locally ⚡

catalog serve dbt_gitlab_data_team

You can now see the generated documentation website at http://localhost:8000.

3.5 Deploy the documentation website! 🚀

A. To deploy on GitHub pages:

catalog deploy github-pages dbt_gitlab_data_team

Mkdocs will deploy the site on GitHub pages (this only works if you are on a github repository).

B. To deploy on Google Cloud Storage Bucket:

catalog deploy gcs dbt_gitlab_data_team

Mkdocs will copy all the files in catalogs/dbt_gitlab_data_team/site to the bucket defined by site_url value of catalogs/dbt_gitlab_data_team/mkdocs.yml. For instance if the site url is http://catalogs.unytics.io/dbt_gitlab_data_team/ it will copy all files under catalogs/dbt_gitlab_data_team/site to gs://catalogs.unytics.io/dbt_gitlab_data_team/

C. To deploy elsewhere:

You can follow these instructions from mkdocs.


💎 4. Generate your dbt documentation

To generate a documentation website for your own dbt project, do the following:

  1. Change directory to your dbt project directory
  2. Download catalogs/dbt documentation example by running catalog download dbt.
  3. Run dbt docs generate to compute target/manifest.json and target/catalog.json.
  4. Generate the assets file by running python catalogs/dbt/generate_assets_file.py. The script will parse target/manifest.json and target/catalog.json to generate the assets file in the expected format.
  5. Run catalog serve dbt to build the website and show it locally.

Keep in touch 🧑‍💻

Join our Slack for any question, to get help for getting started, to speak about a bug, to suggest improvements, or simply if you want to have a chat 🙂.


👋 Contribute

Any contribution is more than welcome 🤗!

  • Add a ⭐ on the repo to show your support
  • Join our Slack and talk with us
  • Raise an issue to raise a bug or suggest improvements
  • Open a PR!
<style> .md-sidebar--primary { display: none!important; } :root { --md-primary-fg-color: #2acfa7ff!important; } </style>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catalog_builder-0.6.tar.gz (8.6 kB view details)

Uploaded Source

File details

Details for the file catalog_builder-0.6.tar.gz.

File metadata

  • Download URL: catalog_builder-0.6.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for catalog_builder-0.6.tar.gz
Algorithm Hash digest
SHA256 6b16914e3d08dd9cac3d2208114c09fa9de832c32588034d4fc7358733b66cf0
MD5 dd6538aa12b9bb1d3c8d298f7dc97533
BLAKE2b-256 41615076462e2335f90f09b69b5b91b2d6c576da812dfbe7be5f207b75fa6df9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page