Extract database metadata, generate dbt-compatible YAML, compare schemas, and deliver reports — all in a single Python import.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

grafreak0

These details have not been verified by PyPI

Project description

data_dictionary_builder

A Python library that automates database documentation — extract live schema metadata, generate dbt-compatible YAML, compare schemas across environments, and deliver PDF reports, all in a single import.

What's new in v0.1.5

Custom output paths — DDHelper now supports explicit models_dir and reports_dir parameters. You can store your models and reports in completely separate locations, or redirect them to a temporary folder during CI/CD.

MongoDB Support — extract metadata from MongoDB collections. Fields and types are automatically inferred by sampling documents. Supports _id as primary key and standard auth mechanisms. Install with pip install "data-dictionary-builder[mongodb]".

What's new in v0.1.4

Multi-recipient email delivery — email_to now accepts a list of addresses (e.g. ["alice@example.com", "bob@example.com"]) in addition to a single string. The EMAIL_TO environment variable supports comma-separated addresses (EMAIL_TO=alice@example.com,bob@example.com). All recipients receive the PDF attachment in one send.

Multi-target Slack delivery — slack_target now accepts a list of targets (e.g. ["#data-alerts", "U012AB3CD"]). The SLACK_NOTIFY_TARGET environment variable supports comma-separated values. Each target receives the full Block Kit report and optional PDF upload independently.

send_notification updated — both email_to and slack_target parameters accept strings or lists; the method routes correctly for either form with no API change.

What's new in v0.1.3

Slack notifications — deliver schema comparison reports directly to any Slack channel or DM alongside the existing email delivery. Use notification_type="slack" or "both" in send_notification(). Requires a Bot User OAuth Token (xoxb-…) and the slack extra: pip install "data-dictionary-builder[slack]".

Optimised metadata generation — parallel extraction is faster with improved thread scheduling and reduced connection overhead across all supported databases.

Smarter schema comparison — type normalisation has been expanded to cover more cross-database equivalences, reducing false-positive mismatches in mixed-engine pipelines.

Leaner exports — JSON metadata exports are more compact, and the to_dict() / from_dict() round-trip is validated automatically to ensure safe use in Airflow XCom and downstream catalog APIs.

Installation

pip

# Core library (SQLite works out of the box)
pip install data-dictionary-builder

# With the connectors you need
pip install "data-dictionary-builder[postgres]"
pip install "data-dictionary-builder[mysql]"
pip install "data-dictionary-builder[clickhouse]"         # ClickHouse HTTP/HTTPS
pip install "data-dictionary-builder[clickhouse-native]"  # ClickHouse native TCP
pip install "data-dictionary-builder[spanner]"
pip install "data-dictionary-builder[oracle]"
pip install "data-dictionary-builder[sqlserver]"
pip install "data-dictionary-builder[mongodb]"

# Everything at once
pip install "data-dictionary-builder[all]"

uv (recommended — faster resolver, built-in virtual environments)

# Install uv
pip install uv
# or on macOS/Linux: curl -LsSf https://astral.sh/uv/install.sh | sh

# Add to your project
uv add data-dictionary-builder

# With specific connectors
uv add "data-dictionary-builder[postgres]"
uv add "data-dictionary-builder[mysql]"
uv add "data-dictionary-builder[clickhouse]"         # ClickHouse HTTP/HTTPS
uv add "data-dictionary-builder[clickhouse-native]"  # ClickHouse native TCP
uv add "data-dictionary-builder[oracle]"
uv add "data-dictionary-builder[sqlserver]"
uv add "data-dictionary-builder[spanner]"
uv add "data-dictionary-builder[mongodb]"

# Everything at once
uv add "data-dictionary-builder[all]"

Or use the CLI to install connectors after the fact:

ddgen install postgres
ddgen install clickhouse
ddgen install oracle
ddgen install sqlserver
ddgen install mongodb
ddgen install all

Supported Databases

Database	Extra	Driver
SQLite	(built-in)	`sqlite3` (stdlib)
PostgreSQL	`[postgres]`	`psycopg2-binary`
MySQL / MariaDB	`[mysql]`	`PyMySQL`
ClickHouse	`[clickhouse]`	`clickhouse-connect` (HTTP/HTTPS) · `clickhouse-driver` (native TCP, optional)
Oracle Database	`[oracle]`	`oracledb` (thin mode — no Oracle Client needed)
SQL Server / Azure SQL	`[sqlserver]`	`pymssql`
Google Cloud Spanner	`[spanner]`	`google-cloud-spanner`
MongoDB	`[mongodb]`	`pymongo`

🚀 Quick Start

1. Installation

pip install data-dictionary-builder

# Or with specific connectors
pip install "data-dictionary-builder[postgres,mongodb]"

2. Basic Usage (Python)

from data_dictionary_builder import DDHelper

# Initialize with custom paths
helper = DDHelper(
    models_dir="my_dbt_project/models", 
    reports_dir="logs/schema_reports"
)

# Extract and generate YAML
config = {
    "db_type": "postgres",
    "host": "localhost",
    "database": "my_db",
    "user": "admin",
    "password": "password"
}
helper.generate_yaml(config, schema_filter=["public"])

3. MongoDB Support

Extract metadata from MongoDB collections with automatic schema inference (via sampling):

from data_dictionary_builder import DDHelper

helper = DDHelper()

mongo_config = {
    "db_type": "mongodb",
    "host": "localhost",
    "port": 27017,
    "database": "my_app_db"
}

# Extracts all collections in 'my_app_db'
# Fields and types are inferred by sampling 100 docs per collection
helper.generate_yaml(mongo_config)

CLI

# Show all commands and supported databases
ddgen --help

# Full module and API reference
ddgen features

# Check which connectors are installed
ddgen connectors

# Install a connector
ddgen install postgres
ddgen install clickhouse
ddgen install all

# Extract metadata and generate YAML in one step
ddgen extract --db-type postgres --host prod.db.io --database mydb --user readonly

# Compare two environments
ddgen compare --source-host prod.db.io --dest-host staging.db.io --source-database mydb

# Show library version and connector summary
ddgen info

# Show version number
ddgen --version

Schema Comparison

from data_dictionary_builder import SchemaComparator, DDHelper

helper = DDHelper(".")
report = SchemaComparator(
    source_config={"db_type": "postgres", "host": "prod-db", ...},
    destination_config={"db_type": "postgres", "host": "staging-db", ...},
).compare_and_generate_report("public", include_yaml_gaps=True)

json_path = helper.save_report(report)
pdf_path  = helper.compile_pdf(source_json=json_path)

# Send via email, Slack, or both — credentials fall back to env vars
# email_to and slack_target accept a string or a list of recipients
helper.send_notification(
    notification_type="both",    # "email" | "slack" | "both"
    report=report,
    pdf_path=pdf_path,
    email_to=["alice@example.com", "bob@example.com"],
    slack_target=["#data-alerts", "#data-eng"],
)

Airflow Integration

DatabaseMetadata serialises to/from plain dicts for XCom:

@task
def extract():
    with MetadataExtractor(**config) as ext:
        return ext.extract_all_schemas(parallel_workers=8).to_dict()

@task
def generate_yaml(db_meta_dict):
    from data_dictionary_builder import DatabaseMetadata, YAMLGenerator
    YAMLGenerator("./models").generate_yaml_files(DatabaseMetadata.from_dict(db_meta_dict))

See tests/airflow_dag_example.py for a complete DAG.

Key Features

Parallel extraction — ThreadPoolExecutor with configurable workers; ClickHouse uses 2 queries and PostgreSQL uses 5 queries per schema regardless of table count
Dual ClickHouse transport — HTTP/HTTPS via clickhouse-connect (default) or native TCP via clickhouse-driver; auto-detected, with dynamic port defaults based on transport and TLS
Schema filtering — exact, glob, prefix, suffix, contains, regex — mix freely
Smart YAML merge — re-running never overwrites descriptions you've written by hand
YAML-aware gap detection — documentation coverage checks read from your existing YAML files, so descriptions you've added are always recognised
Cross-database comparison — compare any two database types; type aliases normalised before diffing
PDF reports — paginated, no row limits, table of contents (requires reportlab)
Unified notifications — send_notification(notification_type="email"|"slack"|"both", ...) delivers PDF reports via SMTP email, Slack (channel or DM), or both simultaneously; all credentials fall back to environment variables
Email delivery — SMTP with env-var credential fallback; PDF attached automatically
Slack delivery — Block Kit–formatted comparison summaries; optional PDF file upload; supports #channel, @user, channel IDs, and user IDs
ExecutionTimer — named task timing with a formatted summary table
Server mode — omit database to scan all databases on a MySQL, ClickHouse, PostgreSQL, or MongoDB server
Rich CLI — ddgen extract, ddgen compare, ddgen features (full API reference), ddgen connectors, ddgen install

Environment Variables

Set these in a .env file (see tests/.env.example) or in your shell:

# ── Notification channel ───────────────────────────────────────────────
# "email" | "slack" | "both"  (default: email)
NOTIFICATION_TYPE=email

# ── SMTP — DDHelper.send_notification() / send_report_email() ──────────
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=you@gmail.com
SMTP_PASSWORD=xxxx xxxx xxxx xxxx
EMAIL_TO=alice@example.com,bob@example.com   # comma-separated for multiple recipients

# ── Slack — DDHelper.send_notification() ───────────────────────────────
# Bot Token Scopes required: chat:write, files:write, channels:read,
#                            users:read, im:write
SLACK_BOT_TOKEN=xoxb-your-token-here
SLACK_NOTIFY_TARGET=#data-alerts,#data-eng   # comma-separated for multiple targets

Documentation

Full user guide, API reference, and troubleshooting: DOCUMENTATION.md

License

MIT — free to use, modify, and distribute in personal and commercial projects.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

grafreak0

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.5

Mar 24, 2026

0.1.4

Mar 16, 2026

0.1.3

Mar 15, 2026

0.1.2 yanked

Mar 14, 2026

Reason this release was yanked:

Email notification bug detected - fixed in v0.1.3

0.1.1

Mar 9, 2026

0.1.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_dictionary_builder-0.1.5.tar.gz (100.0 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_dictionary_builder-0.1.5-py3-none-any.whl (88.8 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file data_dictionary_builder-0.1.5.tar.gz.

File metadata

Download URL: data_dictionary_builder-0.1.5.tar.gz
Upload date: Mar 24, 2026
Size: 100.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for data_dictionary_builder-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`e73abc95c96822fd06c2d3ba14444fc3aa90445970b4b433c75ea5e0b3e08cf3`
MD5	`03ae454dd861202d19da330ec137b969`
BLAKE2b-256	`c967377f16f1670db9a222a3528d626b273c94658243bd894e367ee02e829ff4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_dictionary_builder-0.1.5.tar.gz:

Publisher: publish.yml on GraFreak0/data_dictionary_builder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: data_dictionary_builder-0.1.5.tar.gz
- Subject digest: e73abc95c96822fd06c2d3ba14444fc3aa90445970b4b433c75ea5e0b3e08cf3
- Sigstore transparency entry: 1174492836
- Sigstore integration time: Mar 24, 2026
Source repository:
- Permalink: GraFreak0/data_dictionary_builder@cc6247277ce54588a194c512080cda19e77dc446
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/GraFreak0
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@cc6247277ce54588a194c512080cda19e77dc446
- Trigger Event: push

File details

Details for the file data_dictionary_builder-0.1.5-py3-none-any.whl.

File metadata

Download URL: data_dictionary_builder-0.1.5-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 88.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for data_dictionary_builder-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b0b77b1886534b67d5d44cfc7ed5de38ff667d0f882aaf81f8ffe4911fb17aa`
MD5	`272491472c88371b159a3dd96eadeaeb`
BLAKE2b-256	`799a4d3615db81d62dff36f57de42bb61451875b0eb7a201f3eba7becea04e38`

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_dictionary_builder-0.1.5-py3-none-any.whl:

Publisher: publish.yml on GraFreak0/data_dictionary_builder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: data_dictionary_builder-0.1.5-py3-none-any.whl
- Subject digest: 4b0b77b1886534b67d5d44cfc7ed5de38ff667d0f882aaf81f8ffe4911fb17aa
- Sigstore transparency entry: 1174492841
- Sigstore integration time: Mar 24, 2026
Source repository:
- Permalink: GraFreak0/data_dictionary_builder@cc6247277ce54588a194c512080cda19e77dc446
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/GraFreak0
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@cc6247277ce54588a194c512080cda19e77dc446
- Trigger Event: push

data-dictionary-builder 0.1.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

data_dictionary_builder

Installation

Supported Databases

🚀 Quick Start

1. Installation

2. Basic Usage (Python)

3. MongoDB Support

CLI

Schema Comparison

Airflow Integration

Key Features

Environment Variables

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance