Skip to main content

Extract database metadata, generate dbt-compatible YAML, compare schemas, and deliver reports — all in a single Python import.

Project description

data_dictionary_builder

A Python library that automates database documentation — extract live schema metadata, generate dbt-compatible YAML, compare schemas across environments, and deliver PDF reports, all in a single import.

PyPI Python License: MIT


Installation

# Core library (SQLite works out of the box)
pip install data-dictionary-builder

# With the connectors you need
pip install "data-dictionary-builder[postgres]"
pip install "data-dictionary-builder[mysql]"
pip install "data-dictionary-builder[clickhouse]"
pip install "data-dictionary-builder[spanner]"
pip install "data-dictionary-builder[oracle]"
pip install "data-dictionary-builder[sqlserver]"

# Everything at once
pip install "data-dictionary-builder[all]"

Or use the CLI to install connectors after the fact:

ddgen install postgres
ddgen install clickhouse
ddgen install oracle
ddgen install sqlserver
ddgen install all

Supported Databases

Database Extra Driver
SQLite (built-in) sqlite3 (stdlib)
PostgreSQL [postgres] psycopg2-binary
MySQL / MariaDB [mysql] PyMySQL
ClickHouse [clickhouse] clickhouse-connect (HTTP/HTTPS) · clickhouse-driver (native TCP, optional)
Oracle Database [oracle] oracledb (thin mode — no Oracle Client needed)
SQL Server / Azure SQL [sqlserver] pymssql
Google Cloud Spanner [spanner] google-cloud-spanner

Quick Start

from data_dictionary_builder import MetadataExtractor, YAMLGenerator, DDHelper, ExecutionTimer

timer  = ExecutionTimer()
helper = DDHelper(".")      # creates models/, reports/json/, reports/pdf/

with timer.task("Extract"):
    with MetadataExtractor(
        db_type="postgres", host="localhost", port=5432,
        database="mydb", user="readonly", password="secret",
    ) as ext:
        db_meta = ext.extract_all_schemas(
            schema_filter=["public", "analytics"],
            parallel_workers=8,
        )

with timer.task("Generate YAML"):
    YAMLGenerator(output_dir=str(helper.models_dir)).generate_yaml_files(db_meta)

timer.summary()

CLI

# Show all commands and supported databases
ddgen --help

# Full module and API reference
ddgen features

# Check which connectors are installed
ddgen connectors

# Install a connector
ddgen install postgres
ddgen install clickhouse
ddgen install all

# Extract metadata and generate YAML in one step
ddgen extract --db-type postgres --host prod.db.io --database mydb --user readonly

# Compare two environments
ddgen compare --source-host prod.db.io --dest-host staging.db.io --source-database mydb

# Show library version and connector summary
ddgen info

# Show version number
ddgen --version

Schema Comparison

from data_dictionary_builder import SchemaComparator, DDHelper

helper = DDHelper(".")
report = SchemaComparator(
    source_config={"db_type": "postgres", "host": "prod-db", ...},
    destination_config={"db_type": "postgres", "host": "staging-db", ...},
).compare_and_generate_report("public", include_yaml_gaps=True)

json_path = helper.save_report(report)
pdf_path  = helper.compile_pdf(source_json=json_path)
helper.send_report_email(report=report, pdf_path=pdf_path, email_to="team@example.com")

Airflow Integration

DatabaseMetadata serialises to/from plain dicts for XCom:

@task
def extract():
    with MetadataExtractor(**config) as ext:
        return ext.extract_all_schemas(parallel_workers=8).to_dict()

@task
def generate_yaml(db_meta_dict):
    from data_dictionary_builder import DatabaseMetadata, YAMLGenerator
    YAMLGenerator("./models").generate_yaml_files(DatabaseMetadata.from_dict(db_meta_dict))

See tests/airflow_dag_example.py for a complete DAG.


Key Features

  • Parallel extractionThreadPoolExecutor with configurable workers; ClickHouse uses 2 queries and PostgreSQL uses 5 queries per schema regardless of table count
  • Dual ClickHouse transport — HTTP/HTTPS via clickhouse-connect (default) or native TCP via clickhouse-driver; auto-detected, with dynamic port defaults based on transport and TLS
  • Schema filtering — exact, glob, prefix, suffix, contains, regex — mix freely
  • Smart YAML merge — re-running never overwrites descriptions you've written by hand
  • YAML-aware gap detection — documentation coverage checks read from your existing YAML files, so descriptions you've added are always recognised
  • Cross-database comparison — compare any two database types; type aliases normalised before diffing
  • PDF reports — paginated, no row limits, table of contents (requires reportlab)
  • Email delivery — SMTP with env-var credential fallback; PDF attached automatically
  • ExecutionTimer — named task timing with a formatted summary table
  • Server mode — omit database to scan all databases on a MySQL, ClickHouse, or PostgreSQL server
  • Rich CLIddgen extract, ddgen compare, ddgen features (full API reference), ddgen connectors, ddgen install

Environment Variables

Set these in a .env file (see tests/.env.example) or in your shell:

# SMTP — used by DDHelper.send_report_email() when no credentials are passed explicitly
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=you@gmail.com
SMTP_PASSWORD=xxxx xxxx xxxx xxxx
EMAIL_TO=recipient@example.com

Documentation

Full user guide, API reference, and troubleshooting: DOCUMENTATION.md


License

MIT — free to use, modify, and distribute in personal and commercial projects.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_dictionary_builder-0.1.1.tar.gz (82.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_dictionary_builder-0.1.1-py3-none-any.whl (74.2 kB view details)

Uploaded Python 3

File details

Details for the file data_dictionary_builder-0.1.1.tar.gz.

File metadata

  • Download URL: data_dictionary_builder-0.1.1.tar.gz
  • Upload date:
  • Size: 82.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for data_dictionary_builder-0.1.1.tar.gz
Algorithm Hash digest
SHA256 68168ae7b3b18d304ece8aedf2740e35640bd47719626c1f8453075c12e3e1fa
MD5 b206dbfb76fba8360b5d2a773c85c1ad
BLAKE2b-256 8155ce523739f07ac0a1deadb187c4f9e9d3c14222fe3af4bc92b32319cd273f

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_dictionary_builder-0.1.1.tar.gz:

Publisher: publish.yml on GraFreak0/data_dictionary_builder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_dictionary_builder-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for data_dictionary_builder-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 736ed53e092d6fd27a429992090e0b3d978688dbdad9f32b740129d967a61c32
MD5 a45589239ffeca856d3f84feb90fcef3
BLAKE2b-256 06852fd0305369006834df8c5fe9d1760204df7914ef6399c17fcaedf03082c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_dictionary_builder-0.1.1-py3-none-any.whl:

Publisher: publish.yml on GraFreak0/data_dictionary_builder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page