Extract database metadata, generate dbt-compatible YAML, compare schemas, and deliver reports — all in a single Python import.
Project description
data_dictionary_builder
A Python library that automates database documentation — extract live schema metadata, generate dbt-compatible YAML, compare schemas across environments, and deliver PDF reports, all in a single import.
Installation
# Core library (SQLite works out of the box)
pip install data-dictionary-builder
# With the connectors you need
pip install "data-dictionary-builder[postgres]"
pip install "data-dictionary-builder[mysql]"
pip install "data-dictionary-builder[clickhouse]"
pip install "data-dictionary-builder[spanner]"
pip install "data-dictionary-builder[oracle]"
pip install "data-dictionary-builder[sqlserver]"
# Everything at once
pip install "data-dictionary-builder[all]"
Or use the CLI to install connectors after the fact:
ddgen install postgres
ddgen install clickhouse
ddgen install oracle
ddgen install sqlserver
ddgen install all
Supported Databases
| Database | Extra | Driver |
|---|---|---|
| SQLite | (built-in) | sqlite3 (stdlib) |
| PostgreSQL | [postgres] |
psycopg2-binary |
| MySQL / MariaDB | [mysql] |
PyMySQL |
| ClickHouse | [clickhouse] |
clickhouse-connect (HTTP/HTTPS) · clickhouse-driver (native TCP, optional) |
| Oracle Database | [oracle] |
oracledb (thin mode — no Oracle Client needed) |
| SQL Server / Azure SQL | [sqlserver] |
pymssql |
| Google Cloud Spanner | [spanner] |
google-cloud-spanner |
Quick Start
from data_dictionary_builder import MetadataExtractor, YAMLGenerator, DDHelper, ExecutionTimer
timer = ExecutionTimer()
helper = DDHelper(".") # creates models/, reports/json/, reports/pdf/
with timer.task("Extract"):
with MetadataExtractor(
db_type="postgres", host="localhost", port=5432,
database="mydb", user="readonly", password="secret",
) as ext:
db_meta = ext.extract_all_schemas(
schema_filter=["public", "analytics"],
parallel_workers=8,
)
with timer.task("Generate YAML"):
YAMLGenerator(output_dir=str(helper.models_dir)).generate_yaml_files(db_meta)
timer.summary()
CLI
# Show all commands and supported databases
ddgen --help
# Full module and API reference
ddgen features
# Check which connectors are installed
ddgen connectors
# Install a connector
ddgen install postgres
ddgen install clickhouse
ddgen install all
# Extract metadata and generate YAML in one step
ddgen extract --db-type postgres --host prod.db.io --database mydb --user readonly
# Compare two environments
ddgen compare --source-host prod.db.io --dest-host staging.db.io --source-database mydb
# Show library version and connector summary
ddgen info
# Show version number
ddgen --version
Schema Comparison
from data_dictionary_builder import SchemaComparator, DDHelper
helper = DDHelper(".")
report = SchemaComparator(
source_config={"db_type": "postgres", "host": "prod-db", ...},
destination_config={"db_type": "postgres", "host": "staging-db", ...},
).compare_and_generate_report("public", include_yaml_gaps=True)
json_path = helper.save_report(report)
pdf_path = helper.compile_pdf(source_json=json_path)
helper.send_report_email(report=report, pdf_path=pdf_path, email_to="team@example.com")
Airflow Integration
DatabaseMetadata serialises to/from plain dicts for XCom:
@task
def extract():
with MetadataExtractor(**config) as ext:
return ext.extract_all_schemas(parallel_workers=8).to_dict()
@task
def generate_yaml(db_meta_dict):
from data_dictionary_builder import DatabaseMetadata, YAMLGenerator
YAMLGenerator("./models").generate_yaml_files(DatabaseMetadata.from_dict(db_meta_dict))
See tests/airflow_dag_example.py for a complete DAG.
Key Features
- Parallel extraction —
ThreadPoolExecutorwith configurable workers; ClickHouse uses 2 queries and PostgreSQL uses 5 queries per schema regardless of table count - Dual ClickHouse transport — HTTP/HTTPS via
clickhouse-connect(default) or native TCP viaclickhouse-driver; auto-detected, with dynamic port defaults based on transport and TLS - Schema filtering — exact, glob, prefix, suffix, contains, regex — mix freely
- Smart YAML merge — re-running never overwrites descriptions you've written by hand
- YAML-aware gap detection — documentation coverage checks read from your existing YAML files, so descriptions you've added are always recognised
- Cross-database comparison — compare any two database types; type aliases normalised before diffing
- PDF reports — paginated, no row limits, table of contents (requires
reportlab) - Email delivery — SMTP with env-var credential fallback; PDF attached automatically
- ExecutionTimer — named task timing with a formatted summary table
- Server mode — omit
databaseto scan all databases on a MySQL, ClickHouse, or PostgreSQL server - Rich CLI —
ddgen extract,ddgen compare,ddgen features(full API reference),ddgen connectors,ddgen install
Environment Variables
Set these in a .env file (see tests/.env.example) or in your shell:
# SMTP — used by DDHelper.send_report_email() when no credentials are passed explicitly
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=you@gmail.com
SMTP_PASSWORD=xxxx xxxx xxxx xxxx
EMAIL_TO=recipient@example.com
Documentation
Full user guide, API reference, and troubleshooting: DOCUMENTATION.md
License
MIT — free to use, modify, and distribute in personal and commercial projects.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_dictionary_builder-0.1.1.tar.gz.
File metadata
- Download URL: data_dictionary_builder-0.1.1.tar.gz
- Upload date:
- Size: 82.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68168ae7b3b18d304ece8aedf2740e35640bd47719626c1f8453075c12e3e1fa
|
|
| MD5 |
b206dbfb76fba8360b5d2a773c85c1ad
|
|
| BLAKE2b-256 |
8155ce523739f07ac0a1deadb187c4f9e9d3c14222fe3af4bc92b32319cd273f
|
Provenance
The following attestation bundles were made for data_dictionary_builder-0.1.1.tar.gz:
Publisher:
publish.yml on GraFreak0/data_dictionary_builder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
data_dictionary_builder-0.1.1.tar.gz -
Subject digest:
68168ae7b3b18d304ece8aedf2740e35640bd47719626c1f8453075c12e3e1fa - Sigstore transparency entry: 1067293878
- Sigstore integration time:
-
Permalink:
GraFreak0/data_dictionary_builder@d0e8d4705f201f6e5eadf19c49f3f8a5f55bc09c -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/GraFreak0
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d0e8d4705f201f6e5eadf19c49f3f8a5f55bc09c -
Trigger Event:
push
-
Statement type:
File details
Details for the file data_dictionary_builder-0.1.1-py3-none-any.whl.
File metadata
- Download URL: data_dictionary_builder-0.1.1-py3-none-any.whl
- Upload date:
- Size: 74.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
736ed53e092d6fd27a429992090e0b3d978688dbdad9f32b740129d967a61c32
|
|
| MD5 |
a45589239ffeca856d3f84feb90fcef3
|
|
| BLAKE2b-256 |
06852fd0305369006834df8c5fe9d1760204df7914ef6399c17fcaedf03082c8
|
Provenance
The following attestation bundles were made for data_dictionary_builder-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on GraFreak0/data_dictionary_builder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
data_dictionary_builder-0.1.1-py3-none-any.whl -
Subject digest:
736ed53e092d6fd27a429992090e0b3d978688dbdad9f32b740129d967a61c32 - Sigstore transparency entry: 1067293919
- Sigstore integration time:
-
Permalink:
GraFreak0/data_dictionary_builder@d0e8d4705f201f6e5eadf19c49f3f8a5f55bc09c -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/GraFreak0
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d0e8d4705f201f6e5eadf19c49f3f8a5f55bc09c -
Trigger Event:
push
-
Statement type: