Skip to main content

PostgreSQL schema inventory, SQL extraction, and documentation generation tool

Project description

pgcarter

pgcarter

CI Docs PyPI Python 3.12+ License: Apache 2.0 Ruff

A production-quality PostgreSQL schema inventory, SQL extraction, documentation generation, and database-shape analysis tool. pgcarter is both a schema extraction tool and a lightweight database discovery and profiling platform.

๐Ÿ“– Full documentation: https://blu3pr1n7.github.io/PGCarter/

Quick start

uv pip install -e .          # or: pip install -e .

# 1. Extract a schema inventory (SQL + JSON + docs)
pgcarter index --database mydb --output-dir ./inventory

# 2. Profile it offline (no database needed)
pgcarter analyze --input ./inventory/json --output ./analysis

pgcarter connects to a PostgreSQL database and produces three independent outputs:

  1. Executable SQL โ€” deterministic, ordered PostgreSQL DDL files that can be run to recreate database structures (plus a single ordered apply.sql).
  2. Structured metadata โ€” JSON files describing every extracted asset.
  3. Documentation โ€” Markdown rendered entirely from user-provided Jinja2 templates.

A fourth capability, the analyze subcommand, performs automated database shape analysis โ€” table sizes, column characteristics, data-quality indicators, relationships, and schema-design patterns. See Database analysis & profiling.

It never exports table data. This is a schema and metadata extraction utility only โ€” no INSERT, no COPY, no row contents. The analyzer is likewise read-only: every profiling query is a validated SELECT.

Design boundaries

The two output layers are strictly separated:

  • The SQL generation layer depends on metadata models only โ€” never on templates.
  • The documentation layer is entirely template-driven (Jinja2) and generates no SQL. No Markdown is embedded in Python source.
PostgreSQL โ”€โ”€โ–ถ Extractors โ”€โ”€โ–ถ Metadata Models โ”€โ”ฌโ”€โ–ถ SQL Generator โ”€โ”€โ–ถ sql/
                                                โ””โ”€โ–ถ Jinja Renderer โ”€โ”€โ–ถ docs/
                                                                  โ””โ”€โ”€โ–ถ json/

Installation

Requires Python 3.12+. Uses psycopg (psycopg3) and jinja2. No SQLAlchemy.

# with uv (recommended)
uv venv --python 3.12
uv pip install -e ".[dev]"

# or with pip
python3.12 -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"

Usage

The CLI (built with Typer) has two subcommands: index (schema extraction) and analyze (shape analysis & profiling).

pgcarter index \
  --host localhost \
  --port 5432 \
  --database mydb \
  --user postgres \
  --password secret \
  --output-dir ./inventory \
  --templates-dir ./templates
Argument Default Notes
--host localhost
--port 5432
--database (required)
--user postgres
--password โ€” falls back to PGPASSWORD
--output-dir the database name created if missing; must be writable
--templates-dir ./templates must exist for docs to be generated
--schema public repeatable; architecture supports multiple schemas
--log-level INFO DEBUG/INFO/WARNING/ERROR (or LOG_LEVEL)
--pretty / --no-pretty JSON colourised console logs for local dev (or LOG_PRETTY)

The same --log-level / --pretty options apply to every subcommand.

The connection never writes: it issues SET default_transaction_read_only = on.

Output structure

inventory/
โ”œโ”€โ”€ sql/
โ”‚   โ”œโ”€โ”€ apply.sql                 # full schema, dependency-ordered, one run
โ”‚   โ”œโ”€โ”€ database.sql
โ”‚   โ”œโ”€โ”€ extensions.sql
โ”‚   โ”œโ”€โ”€ roles.sql
โ”‚   โ””โ”€โ”€ schemas/
โ”‚       โ”œโ”€โ”€ public.sql
โ”‚       โ””โ”€โ”€ public/
โ”‚           โ”œโ”€โ”€ tables/        indexes/        functions/
โ”‚           โ”œโ”€โ”€ views/         triggers/       sequences/
โ”‚           โ””โ”€โ”€ permissions/
โ”œโ”€โ”€ json/
โ”‚   โ”œโ”€โ”€ database.json   schemas.json   tables.json   indexes.json
โ”‚   โ”œโ”€โ”€ views.json      functions.json triggers.json sequences.json
โ”‚   โ”œโ”€โ”€ extensions.json roles.json     permissions.json
โ”‚   โ”œโ”€โ”€ relationships.json
โ”‚   โ”œโ”€โ”€ relationships.dot         # Graphviz relationship graph
โ”‚   โ”œโ”€โ”€ schemas/<schema>.json
โ”‚   โ””โ”€โ”€ tables/<table>.json
โ”œโ”€โ”€ docs/
โ”‚   โ”œโ”€โ”€ index.md   database.md   roles.md   permissions.md
โ”‚   โ””โ”€โ”€ schemas/<schema>/{schema.md, tables/, views/, functions/, triggers/, indexes/}
โ””โ”€โ”€ report.json                   # extracted / skipped / warnings / errors

Every SQL file starts with a provenance header:

-- Generated by pgcarter
-- Database: <database>
-- Object: <object>
-- Generated: <timestamp>

Every generated document starts with the equivalent HTML-comment header.

apply.sql โ€” guaranteed executable

Individual object files are executable when applied in dependency order. For convenience sql/apply.sql is a single, dependency-ordered script (extensions โ†’ roles โ†’ schemas โ†’ sequences โ†’ tables (FK-sorted) โ†’ sequence ownership โ†’ indexes โ†’ views (dependency-sorted) โ†’ functions โ†’ triggers โ†’ grants) that recreates the whole schema in one run. Role creation is guarded so the script is re-runnable. This is exercised by the integration suite, which replays apply.sql into a fresh database.

Extraction scope

Reads from PostgreSQL system catalogs (pg_catalog) and information_schema. Extracts: database, schemas, tables (columns, types, defaults, nullability, identity and generated columns, comments), constraints (PK/FK/unique/check/ exclusion), indexes (incl. partial & expression), views & materialized views, functions & procedures (signature, parameters, return type, language, volatility, security mode, owner), triggers, sequences, extensions, roles (attributes, memberships), privileges (database/schema/table/column/sequence/ function), and a relationship graph (foreign keys, view/trigger/sequence dependencies).

Default scope is the public schema; pass --schema (repeatable) for others.

Documentation templates

Documentation is completely template-driven. Templates are discovered dynamically โ€” if a template is missing, a warning is logged and processing continues. Provide a templates/ directory:

templates/
โ”œโ”€โ”€ index.md.j2        # home page (database overview)
โ”œโ”€โ”€ database.md.j2     # database detail
โ”œโ”€โ”€ schema.md.j2       # per schema
โ”œโ”€โ”€ table.md.j2        # per table
โ”œโ”€โ”€ indexes.md.j2      # per-schema index overview
โ”œโ”€โ”€ function.md.j2     # per function/procedure
โ”œโ”€โ”€ view.md.j2         # per view / materialized view
โ”œโ”€โ”€ trigger.md.j2      # per trigger
โ”œโ”€โ”€ permissions.md.j2  # privilege matrix
โ””โ”€โ”€ roles.md.j2        # roles

Note: the spec lists index.md.j2 twice. Here index.md.j2 is the home page and indexes.md.j2 renders the per-schema index overview, avoiding the filename collision.

Templates receive the full structured context (database, schemas, tables, indexes, views, functions, triggers, sequences, extensions, roles, permissions, relationships, generated_at) plus object-specific variables (e.g. table, function, schema_tables).

Error handling

Connection failures, insufficient privileges, and unsupported objects are handled gracefully: a failure in one extractor is captured and the run continues. Everything is summarised in report.json:

{
  "database": "mydb",
  "summary": { "extracted": {"tables": 120}, "skipped_count": 1,
               "warning_count": 0, "error_count": 0, "generated_file_count": 412 },
  "skipped": [], "warnings": [], "errors": [], "generated_files": []
}

Exit codes: 0 success, 2 completed with recorded errors, 1 fatal error.

Database analysis & profiling

The analyze subcommand turns the inventory into an understanding of the database's shape: what kinds of datasets exist, how big tables are, column characteristics, data-quality signals, relationships, and likely design issues. It is not a data dump โ€” it profiles structure and statistics.

The existing JSON inventory is the analyzer's first input source, so it runs in two modes.

Offline mode (structure only, no database)

Analyze an existing JSON inventory with no connection. Identifies possible checks from table structure, column names, data types, constraints, relationships, and indexes โ€” and records the exact read-only SQL each check would run online.

pgcarter analyze --input ./inventory/json

Online mode (connect and profile)

Connects to PostgreSQL and enriches the structural analysis with row counts, null statistics, cardinality estimates, value distributions, freshness checks, and size metrics.

pgcarter analyze \
  --database mydb \
  --schema public \
  --output ./analysis \
  --sample-size 10000
Argument Default Notes
--input โ€” JSON inventory directory โ†’ offline mode
--database โ€” connect & profile โ†’ online mode
--host/--port/--user/--password localhost/5432/postgres/PGPASSWORD online connection
--schema public repeatable
--output ./analysis output directory
--templates-dir ./templates Jinja2 templates for analysis docs
--config โ€” analysis YAML (enabled checks, thresholds)
--sample-size โ€” row cap for expensive per-column scans
--statement-timeout 0 per-query timeout (ms); an overrun is logged and skipped

Provide --input (offline), --database (online), or both (use the JSON as the inventory base while connecting for statistics).

Output

analysis/
โ”œโ”€โ”€ report.json          # full analysis (tables, metrics, checks, warnings)
โ”œโ”€โ”€ report.md            # human-readable summary (rendered from a template)
โ”œโ”€โ”€ warnings.json        # every non-informational finding
โ”œโ”€โ”€ tables/
โ”‚   โ”œโ”€โ”€ users.json       # per-table metrics, columns, checks
โ”‚   โ””โ”€โ”€ orders.json
โ”œโ”€โ”€ run-report.json      # run summary (extracted counts / errors)
โ””โ”€โ”€ docs/analysis/
    โ”œโ”€โ”€ index.md         # rendered overview
    โ”œโ”€โ”€ warnings.md
    โ””โ”€โ”€ tables/<table>.md

A worked example lives in examples/analysis/.

Checks

Checks are plugin-style: each is a class registered with @register, and adding a new check requires nothing more than dropping a new class into the relevant pgcarter/analyzer/checks/ module. Categories:

  • Tables โ€” table_size, row_count (empty / extremely large), and growth_indicators (freshness window from created_at/updated_at/โ€ฆ), plus structural table_structure (missing primary key, very wide tables).
  • Columns โ€” null_analysis, cardinality (unique identifiers, low cardinality, text enums), distribution (min/max/avg), string_profiling (avg/min/max length), and name heuristics: identifier_detection, timestamp_detection, email_detection, status_columns, suspicious_columns.
  • Indexes โ€” duplicate_indexes, missing_fk_indexes, unused_indexes.
  • Relationships โ€” heavily_referenced (fan-in importance), relationship_depth, orphan_relationships (rows with a missing parent).
  • Quality โ€” duplicate_primary_keys, duplicate_unique_values, unused_tables.

Each finding has a severity (info/warning/critical); warnings and criticals are collected into warnings.json.

Resilience & progress

Online runs degrade gracefully. A table the connecting role cannot read (permission denied), a missing object, or a query that exceeds --statement-timeout is logged as a single warning and skipped โ€” recorded under skipped in run-report.json, never escalated to a run error (the exit code stays 0). The denied relation is remembered, so its remaining columns are skipped without further round trips, and it still receives structural analysis. Per-table progress is logged ([n/N] analyzing โ€ฆ) so long runs are observable.

For large databases, prefer --sample-size (bounds per-column scans) and --statement-timeout (caps any single query); together they keep a run bounded and visible.

Query safety & performance

Every generated query is mechanically validated to be a single read-only SELECT/WITH with no INSERT/UPDATE/DELETE/DROP/TRUNCATE/ALTER/โ€ฆ (see pgcarter/analyzer/queries.py::assert_safe). Identifiers are quoted and schema-qualified. Row-count and size checks prefer PostgreSQL's own statistics (pg_class.reltuples, pg_total_relation_size) over scanning, and --sample-size N bounds expensive per-column aggregates with a LIMIT subquery. Identical queries from different checks share one round trip.

Configuration

analysis:
  enabled_checks:        # omit to run every check
    - null_analysis
    - cardinality
    - table_size
  thresholds:
    high_null_percentage: 80
    low_cardinality_limit: 10
    large_table_rows: 10000000
    unique_ratio: 0.99
    long_text_length: 10000
  # sample_size: 10000   # --sample-size overrides
pgcarter analyze --input ./inventory/json --config analysis.yml

A ready-to-edit example is in analysis.yml.

Analysis templates

Like the rest of the project, analysis documentation is completely template-driven โ€” no Markdown is embedded in Python. Add to templates/:

templates/
โ”œโ”€โ”€ analysis.md.j2          # overview (report.md + docs/analysis/index.md)
โ”œโ”€โ”€ table_analysis.md.j2    # per table
โ”œโ”€โ”€ column_analysis.md.j2   # per-column fragment (included by table_analysis)
โ””โ”€โ”€ warnings.md.j2          # warnings list

Logging

Logging is built on structlog. The default is structured JSON on stdout, ready for log aggregators (Datadog, ELK/ OpenSearch, CloudWatch, Loki, โ€ฆ):

{"event": "database_connecting", "host": "localhost", "database": "mydb",
 "level": "info", "logger": "pgcarter.extractor.connection",
 "timestamp": "2026-06-23T10:30:00.123456Z"}

For local development, enable colourised console output:

pgcarter index --database mydb --pretty
# or, for any process:
LOG_PRETTY=true LOG_LEVEL=DEBUG pgcarter analyze --input ./inventory/json
2026-06-23T10:30:00Z [info] database_connecting host=localhost database=mydb

Configuration:

Source Production (default) Local dev
CLI flag --no-pretty --pretty
Env var LOG_PRETTY=false LOG_PRETTY=true
Level --log-level / LOG_LEVEL (default INFO)

Using it in code

Setup happens once, in pgcarter/logging_config.py:

from pgcarter.logging_config import configure_logging, get_logger

configure_logging(pretty_logs=False, level="INFO")  # JSON to stdout
log = get_logger(__name__)
log.info("user_created", user_id=123, plan="enterprise")   # structured
log.exception("payment_failed", payment_id=123)            # structured traceback

Both stdlib logging.getLogger(...) and structlog get_logger(...) render through the same pipeline, so existing log.info("Extracting %s", x) calls keep working alongside structured events.

Attach request/job correlation (or any global metadata) with contextvars โ€” it is added to every subsequent event automatically:

import structlog
structlog.contextvars.bind_contextvars(service="pgcarter", request_id="abc123")

Never log secrets, tokens, or sensitive row data โ€” logging stays advisory and metadata-only, like the rest of the tool.

Documentation site

The project documentation (this README's content plus installation, usage, CLI, architecture, and API reference) is published with MkDocs Material to GitHub Pages: https://blu3pr1n7.github.io/PGCarter/.

Build or preview it locally:

uv pip install -e ".[docs]"   # or: make docs
make docs-serve     # live preview at http://127.0.0.1:8000
make docs-build     # strict build (fails on warnings/broken links)

The .github/workflows/docs.yml workflow builds the site with mkdocs build --strict and deploys it on every push to master.

One-time manual step: GitHub Pages must be set to publish from GitHub Actions before the first deploy. In the repository, go to Settings โ†’ Pages โ†’ Build and deployment and set Source to โ€œGitHub Actionsโ€. (This cannot be enabled from the workflow itself; it is a repository setting.) Once set, the docs workflow publishes automatically.

Development

make dev          # install with dev dependencies
make test         # unit tests (no database needed)
make lint         # ruff
make typecheck    # mypy
make coverage     # unit tests with coverage

Integration / e2e tests (dockerised database)

A dedicated, disposable PostgreSQL 16 instance is provided via docker-compose.yml (host port 55432, isolated from any other local Postgres). The seed schema in tests/fixtures/seed.sql exercises every asset category.

make db-up             # start the test DB and wait until healthy
make test-integration  # run integration tests against it
make e2e               # run the index CLI end-to-end into ./build/e2e
make e2e-analyze       # run an online analysis into ./build/analysis
make test-all          # unit + integration
make db-down           # tear down (removes the data)

The integration suite asserts that no row data is ever emitted and that the generated apply.sql replays cleanly into a fresh database.

Project layout

pgcarter/
โ”œโ”€โ”€ cli.py            # Typer entry point (index + analyze subcommands)
โ”œโ”€โ”€ config.py         # configuration + defaults
โ”œโ”€โ”€ main.py           # orchestration
โ”œโ”€โ”€ report.py         # run report
โ”œโ”€โ”€ logging_config.py # structured logging
โ”œโ”€โ”€ models/           # dataclass metadata models (single source of truth)
โ”œโ”€โ”€ extractor/        # one module per asset category + orchestrator
โ”œโ”€โ”€ sql/              # deterministic DDL generation (template-free)
โ”œโ”€โ”€ docs/             # Jinja2 renderer (no SQL)
โ”œโ”€โ”€ writers/          # SQL / JSON / DOT / apply.sql writers
โ””โ”€โ”€ analyzer/         # database shape analysis & profiling
    โ”œโ”€โ”€ runner.py     #   analyze orchestration (offline/online)
    โ”œโ”€โ”€ config.py     #   analysis config (enabled checks, thresholds)
    โ”œโ”€โ”€ models.py     #   analysis result models
    โ”œโ”€โ”€ heuristics.py #   column name/type semantics
    โ”œโ”€โ”€ queries.py    #   read-only SQL builders + safety guard
    โ”œโ”€โ”€ queries/      #   *.sql templates (table_stats, column_stats, relationships)
    โ”œโ”€โ”€ rules.py      #   Check base, registry, AnalysisContext
    โ”œโ”€โ”€ checks/       #   plugin checks: tables/columns/indexes/relationships/quality/statistics
    โ”œโ”€โ”€ engine.py     #   runs checks, assembles the report
    โ”œโ”€โ”€ loader.py     #   reconstruct an Inventory from JSON (offline input)
    โ””โ”€โ”€ writer.py     #   analysis JSON + Jinja2 docs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pgcarter-0.1.0.tar.gz (223.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pgcarter-0.1.0-py3-none-any.whl (82.8 kB view details)

Uploaded Python 3

File details

Details for the file pgcarter-0.1.0.tar.gz.

File metadata

  • Download URL: pgcarter-0.1.0.tar.gz
  • Upload date:
  • Size: 223.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"CachyOS Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pgcarter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a0a5e3f7201ca5cfd851948598a7188ccf6c2e7629758e0e3735eac3e29e19ce
MD5 0e7d3949ac587a2603e7c1e98ad9e161
BLAKE2b-256 77aed4894ec0b17208bd26acfe8484064418c0f5d9af20e3eb19325c49d0c066

See more details on using hashes here.

File details

Details for the file pgcarter-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pgcarter-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 82.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"CachyOS Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pgcarter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d7b4a62ba9bf4a5772af39ba625b249196087114f05fae3fdfee7ed6f576e99
MD5 463917021703fea267441af815db2c0e
BLAKE2b-256 61b96d88e623869f6a16d0d7a8b6d33f351a6b2d8e0b7970bcb1e96839cabc7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page