Skip to main content

Extract database metadata, generate dbt-compatible YAML, compare schemas, and deliver reports — all in a single Python import.

Project description

data_dictionary_builder

A Python library that automates database documentation — extract live schema metadata, generate dbt-compatible YAML, compare schemas across environments, and deliver PDF reports, all in a single import.

PyPI Python License: MIT

Logo

What's new in v0.1.5

  • Custom output pathsDDHelper now supports explicit models_dir and reports_dir parameters. You can store your models and reports in completely separate locations, or redirect them to a temporary folder during CI/CD.
  • MongoDB Support — extract metadata from MongoDB collections. Fields and types are automatically inferred by sampling documents. Supports _id as primary key and standard auth mechanisms. Install with pip install "data-dictionary-builder[mongodb]".

What's new in v0.1.4

  • Multi-recipient email deliveryemail_to now accepts a list of addresses (e.g. ["alice@example.com", "bob@example.com"]) in addition to a single string. The EMAIL_TO environment variable supports comma-separated addresses (EMAIL_TO=alice@example.com,bob@example.com). All recipients receive the PDF attachment in one send.
  • Multi-target Slack deliveryslack_target now accepts a list of targets (e.g. ["#data-alerts", "U012AB3CD"]). The SLACK_NOTIFY_TARGET environment variable supports comma-separated values. Each target receives the full Block Kit report and optional PDF upload independently.
  • send_notification updated — both email_to and slack_target parameters accept strings or lists; the method routes correctly for either form with no API change.

What's new in v0.1.3

  • Slack notifications — deliver schema comparison reports directly to any Slack channel or DM alongside the existing email delivery. Use notification_type="slack" or "both" in send_notification(). Requires a Bot User OAuth Token (xoxb-…) and the slack extra: pip install "data-dictionary-builder[slack]".
  • Optimised metadata generation — parallel extraction is faster with improved thread scheduling and reduced connection overhead across all supported databases.
  • Smarter schema comparison — type normalisation has been expanded to cover more cross-database equivalences, reducing false-positive mismatches in mixed-engine pipelines.
  • Leaner exports — JSON metadata exports are more compact, and the to_dict() / from_dict() round-trip is validated automatically to ensure safe use in Airflow XCom and downstream catalog APIs.

Installation

pip

# Core library (SQLite works out of the box)
pip install data-dictionary-builder

# With the connectors you need
pip install "data-dictionary-builder[postgres]"
pip install "data-dictionary-builder[mysql]"
pip install "data-dictionary-builder[clickhouse]"         # ClickHouse HTTP/HTTPS
pip install "data-dictionary-builder[clickhouse-native]"  # ClickHouse native TCP
pip install "data-dictionary-builder[spanner]"
pip install "data-dictionary-builder[oracle]"
pip install "data-dictionary-builder[sqlserver]"
pip install "data-dictionary-builder[mongodb]"

# Everything at once
pip install "data-dictionary-builder[all]"

uv (recommended — faster resolver, built-in virtual environments)

# Install uv
pip install uv
# or on macOS/Linux: curl -LsSf https://astral.sh/uv/install.sh | sh

# Add to your project
uv add data-dictionary-builder

# With specific connectors
uv add "data-dictionary-builder[postgres]"
uv add "data-dictionary-builder[mysql]"
uv add "data-dictionary-builder[clickhouse]"         # ClickHouse HTTP/HTTPS
uv add "data-dictionary-builder[clickhouse-native]"  # ClickHouse native TCP
uv add "data-dictionary-builder[oracle]"
uv add "data-dictionary-builder[sqlserver]"
uv add "data-dictionary-builder[spanner]"
uv add "data-dictionary-builder[mongodb]"

# Everything at once
uv add "data-dictionary-builder[all]"

Or use the CLI to install connectors after the fact:

ddgen install postgres
ddgen install clickhouse
ddgen install oracle
ddgen install sqlserver
ddgen install mongodb
ddgen install all

Supported Databases

Database Extra Driver
SQLite (built-in) sqlite3 (stdlib)
PostgreSQL [postgres] psycopg2-binary
MySQL / MariaDB [mysql] PyMySQL
ClickHouse [clickhouse] clickhouse-connect (HTTP/HTTPS) · clickhouse-driver (native TCP, optional)
Oracle Database [oracle] oracledb (thin mode — no Oracle Client needed)
SQL Server / Azure SQL [sqlserver] pymssql
Google Cloud Spanner [spanner] google-cloud-spanner
MongoDB [mongodb] pymongo

🚀 Quick Start

1. Installation

pip install data-dictionary-builder

# Or with specific connectors
pip install "data-dictionary-builder[postgres,mongodb]"

2. Basic Usage (Python)

from data_dictionary_builder import DDHelper

# Initialize with custom paths
helper = DDHelper(
    models_dir="my_dbt_project/models", 
    reports_dir="logs/schema_reports"
)

# Extract and generate YAML
config = {
    "db_type": "postgres",
    "host": "localhost",
    "database": "my_db",
    "user": "admin",
    "password": "password"
}
helper.generate_yaml(config, schema_filter=["public"])

3. MongoDB Support

Extract metadata from MongoDB collections with automatic schema inference (via sampling):

from data_dictionary_builder import DDHelper

helper = DDHelper()

mongo_config = {
    "db_type": "mongodb",
    "host": "localhost",
    "port": 27017,
    "database": "my_app_db"
}

# Extracts all collections in 'my_app_db'
# Fields and types are inferred by sampling 100 docs per collection
helper.generate_yaml(mongo_config)

CLI

# Show all commands and supported databases
ddgen --help

# Full module and API reference
ddgen features

# Check which connectors are installed
ddgen connectors

# Install a connector
ddgen install postgres
ddgen install clickhouse
ddgen install all

# Extract metadata and generate YAML in one step
ddgen extract --db-type postgres --host prod.db.io --database mydb --user readonly

# Compare two environments
ddgen compare --source-host prod.db.io --dest-host staging.db.io --source-database mydb

# Show library version and connector summary
ddgen info

# Show version number
ddgen --version

Schema Comparison

from data_dictionary_builder import SchemaComparator, DDHelper

helper = DDHelper(".")
report = SchemaComparator(
    source_config={"db_type": "postgres", "host": "prod-db", ...},
    destination_config={"db_type": "postgres", "host": "staging-db", ...},
).compare_and_generate_report("public", include_yaml_gaps=True)

json_path = helper.save_report(report)
pdf_path  = helper.compile_pdf(source_json=json_path)

# Send via email, Slack, or both — credentials fall back to env vars
# email_to and slack_target accept a string or a list of recipients
helper.send_notification(
    notification_type="both",    # "email" | "slack" | "both"
    report=report,
    pdf_path=pdf_path,
    email_to=["alice@example.com", "bob@example.com"],
    slack_target=["#data-alerts", "#data-eng"],
)

Airflow Integration

DatabaseMetadata serialises to/from plain dicts for XCom:

@task
def extract():
    with MetadataExtractor(**config) as ext:
        return ext.extract_all_schemas(parallel_workers=8).to_dict()

@task
def generate_yaml(db_meta_dict):
    from data_dictionary_builder import DatabaseMetadata, YAMLGenerator
    YAMLGenerator("./models").generate_yaml_files(DatabaseMetadata.from_dict(db_meta_dict))

See tests/airflow_dag_example.py for a complete DAG.


Key Features

  • Parallel extractionThreadPoolExecutor with configurable workers; ClickHouse uses 2 queries and PostgreSQL uses 5 queries per schema regardless of table count
  • Dual ClickHouse transport — HTTP/HTTPS via clickhouse-connect (default) or native TCP via clickhouse-driver; auto-detected, with dynamic port defaults based on transport and TLS
  • Schema filtering — exact, glob, prefix, suffix, contains, regex — mix freely
  • Smart YAML merge — re-running never overwrites descriptions you've written by hand
  • YAML-aware gap detection — documentation coverage checks read from your existing YAML files, so descriptions you've added are always recognised
  • Cross-database comparison — compare any two database types; type aliases normalised before diffing
  • PDF reports — paginated, no row limits, table of contents (requires reportlab)
  • Unified notificationssend_notification(notification_type="email"|"slack"|"both", ...) delivers PDF reports via SMTP email, Slack (channel or DM), or both simultaneously; all credentials fall back to environment variables
  • Email delivery — SMTP with env-var credential fallback; PDF attached automatically
  • Slack delivery — Block Kit–formatted comparison summaries; optional PDF file upload; supports #channel, @user, channel IDs, and user IDs
  • ExecutionTimer — named task timing with a formatted summary table
  • Server mode — omit database to scan all databases on a MySQL, ClickHouse, PostgreSQL, or MongoDB server
  • Rich CLIddgen extract, ddgen compare, ddgen features (full API reference), ddgen connectors, ddgen install

Environment Variables

Set these in a .env file (see tests/.env.example) or in your shell:

# ── Notification channel ───────────────────────────────────────────────
# "email" | "slack" | "both"  (default: email)
NOTIFICATION_TYPE=email

# ── SMTP — DDHelper.send_notification() / send_report_email() ──────────
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=you@gmail.com
SMTP_PASSWORD=xxxx xxxx xxxx xxxx
EMAIL_TO=alice@example.com,bob@example.com   # comma-separated for multiple recipients

# ── Slack — DDHelper.send_notification() ───────────────────────────────
# Bot Token Scopes required: chat:write, files:write, channels:read,
#                            users:read, im:write
SLACK_BOT_TOKEN=xoxb-your-token-here
SLACK_NOTIFY_TARGET=#data-alerts,#data-eng   # comma-separated for multiple targets

Documentation

Full user guide, API reference, and troubleshooting: DOCUMENTATION.md


License

MIT — free to use, modify, and distribute in personal and commercial projects.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_dictionary_builder-0.1.5.tar.gz (100.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_dictionary_builder-0.1.5-py3-none-any.whl (88.8 kB view details)

Uploaded Python 3

File details

Details for the file data_dictionary_builder-0.1.5.tar.gz.

File metadata

  • Download URL: data_dictionary_builder-0.1.5.tar.gz
  • Upload date:
  • Size: 100.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for data_dictionary_builder-0.1.5.tar.gz
Algorithm Hash digest
SHA256 e73abc95c96822fd06c2d3ba14444fc3aa90445970b4b433c75ea5e0b3e08cf3
MD5 03ae454dd861202d19da330ec137b969
BLAKE2b-256 c967377f16f1670db9a222a3528d626b273c94658243bd894e367ee02e829ff4

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_dictionary_builder-0.1.5.tar.gz:

Publisher: publish.yml on GraFreak0/data_dictionary_builder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_dictionary_builder-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for data_dictionary_builder-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4b0b77b1886534b67d5d44cfc7ed5de38ff667d0f882aaf81f8ffe4911fb17aa
MD5 272491472c88371b159a3dd96eadeaeb
BLAKE2b-256 799a4d3615db81d62dff36f57de42bb61451875b0eb7a201f3eba7becea04e38

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_dictionary_builder-0.1.5-py3-none-any.whl:

Publisher: publish.yml on GraFreak0/data_dictionary_builder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page