Skip to main content

A Python library for validating and upserting data into PostgreSQL with automated QA checks.

Project description

pg-upsert

ci/cd codecov Documentation Status PyPI Latest Release PyPI Downloads Python Version Support

pg-upsert is a Python package for validating and upserting data from staging tables into base tables in PostgreSQL. It runs automated QA checks, reports errors with rich formatted output, and performs dependency-aware upserts.

pg-upsert terminal output with interactive compare table

Why Use pg-upsert?

  • 7 Automated QA Checks – Validates NOT NULL, PRIMARY KEY, UNIQUE, FOREIGN KEY, CHECK CONSTRAINT, column existence, and column type compatibility before any modifications occur.
  • Interactive Confirmation – Two UI backends: Textual TUI (terminal) and Tkinter (desktop). Auto-detected or choose with --ui auto|textual|tkinter. The compare-tables dialog includes a Highlight Diffs toggle that tints matching/changed rows and flags the exact cells that differ, skipping any columns excluded from the upsert.
  • Structured Resultsrun() returns an UpsertResult with per-table stats, QA errors, and JSON serialization (--output=json for CI/CD pipelines).
  • Exportable Fix Sheets--export-failures <dir> writes an actionable report of failing rows: one row per unique violating staging row with an _issues column listing every problem (NULL in 'genre', duplicate PK, FK violation, etc.) so users can open it in Excel and fix the data. Supports CSV (file per table), JSON (nested), and XLSX (sheets per table) via --export-format.
  • Schema Validation--check-schema flag validates column existence and type compatibility without running data checks or upserts.
  • Flexible Upsert Strategies – Supports upsert, update, and insert methods.
  • Dependency-Aware Ordering – Tables are processed in FK dependency order automatically.
  • Rich Output – Colored pass/fail indicators, formatted tables, and dual console+logfile output.

Usage

Python API

from pg_upsert import PgUpsert

result = PgUpsert(
    uri="postgresql://user@localhost:5432/mydb",
    tables=("genres", "publishers", "books", "authors", "book_authors"),
    staging_schema="staging",
    base_schema="public",
    do_commit=True,
    upsert_method="upsert",
    exclude_cols=("rev_user", "rev_time"),
    exclude_null_check_cols=("book_alias",),
).run()

# UpsertResult provides structured access to results
print(result.qa_passed)       # True if all QA checks passed
print(result.committed)       # True if changes were committed
print(result.total_updated)   # Total rows updated across all tables
print(result.total_inserted)  # Total rows inserted across all tables
print(result.to_json())       # JSON serialization for CI/CD

Using an existing connection:

import psycopg2
from pg_upsert import PgUpsert

conn = psycopg2.connect(host="localhost", port=5432, dbname="mydb", user="user", password="pass")

ups = PgUpsert(
    conn=conn,
    tables=("genres", "publishers", "books"),
    staging_schema="staging",
    base_schema="public",
    do_commit=True,
)
result = ups.run()

# Drop all ups_* temp objects (connection stays open)
ups.cleanup()

QA-only mode (no upsert):

from pg_upsert import PgUpsert

ups = PgUpsert(
    uri="postgresql://user@localhost:5432/mydb",
    tables=("genres", "books"),
    staging_schema="staging",
    base_schema="public",
).qa_all()

# qa_passed is False only when ERROR-level findings exist.
# qa_errors returns ERROR findings only (block the upsert).
# qa_warnings returns WARNING findings only (informational, do not block).
# qa_findings returns all findings combined.
# qa_passed may be True while qa_warnings is non-empty.
if not ups.qa_passed:
    for err in ups.qa_errors:
        print(f"{err.table}: {err.check_type.value}{err.details}")

Schema compatibility check (column existence and type mismatches only):

from pg_upsert import PgUpsert

ups = PgUpsert(
    uri="postgresql://user@localhost:5432/mydb",
    tables=("genres", "books"),
    staging_schema="staging",
    base_schema="public",
).qa_column_existence().qa_type_mismatch()

if ups.qa_errors:
    for err in ups.qa_errors:
        print(f"{err.table}: {err.check_type.value}{err.details}")
else:
    print("Schemas are compatible")

Pipeline callbacks (per-table progress):

from pg_upsert import PgUpsert, CallbackEvent

def on_event(event):
    if event.event == CallbackEvent.QA_TABLE_COMPLETE:
        print(f"QA {'passed' if event.qa_passed else 'failed'} for {event.table}")
    elif event.event == CallbackEvent.UPSERT_TABLE_COMPLETE:
        print(f"{event.table}: {event.rows_inserted} inserted, {event.rows_updated} updated")

result = PgUpsert(
    uri="postgresql://user@localhost:5432/mydb",
    tables=("genres", "books"),
    staging_schema="staging",
    base_schema="public",
    do_commit=True,
    callback=on_event,
).run()

CLI

pg-upsert -h localhost -p 5432 -d mydb -u user \
  -s staging -b public \
  -t genres -t publishers -t books -t authors -t book_authors \
  -x rev_user -x rev_time \
  --commit
Option Description
-h, --host Database host
-p, --port Database port (default: 5432)
-d, --database Database name
-u, --user Database user (see Authentication)
-s, --staging-schema Staging schema name (default: staging)
-b, --base-schema Base schema name (default: public)
-e, --encoding Database connection encoding (default: utf-8)
-t, --table Table name to process (repeatable)
-x, --exclude-columns Columns to exclude from upsert (repeatable)
-n, --null-columns Columns to skip during NOT NULL checks (repeatable)
-m, --upsert-method upsert, update, or insert (default: upsert)
-c, --commit Commit changes (default: roll back)
-i, --interactive Prompt for confirmation at each step
-l, --logfile Write log to file (appends, does not overwrite)
-o, --output Output format: text (default) or json
--check-schema Validate column existence and types only, then exit
--compact Use compact grid format for QA summary
--ui Interactive UI: auto (default), textual, or tkinter
--export-failures Directory to write a QA failure fix sheet into
--export-format Fix sheet format: csv (default), json, or xlsx
--export-max-rows Max rows to capture per check per table (default 1000)
--strict-columns Treat all missing staging columns as errors
-f, --config-file Path to YAML configuration file
-g, --generate-config Generate a template config file
-v, --version Show version and exit
--docs Open documentation in browser
--debug Enable debug output

[!NOTE] CLI arguments take precedence over configuration file values. Explicit CLI flags are never overridden by the config file.

Configuration File

Create a YAML config file (see pg-upsert.example.yaml):

debug: false
commit: false
interactive: false
upsert_method: "upsert"  # Options: "upsert", "insert", "update"
logfile: "pg_upsert.log"
host: "localhost"
port: 5432
user: "docker"
database: "dev"
staging_schema: "staging"
base_schema: "public"
encoding: "utf-8"
tables:
  - "authors"
  - "publishers"
  - "books"
  - "book_authors"
  - "genres"
exclude_columns:
  - "rev_time"
  - "rev_user"
null_columns:
  - "book_alias"
output: "text"  # Options: "text", "json"
check_schema: false
compact: false
ui_mode: "auto"  # Options: "auto", "textual", "tkinter"
export_failures: null  # Directory to write QA failure fix sheet; null to disable
export_format: "csv"  # Fix sheet format: "csv", "json", or "xlsx"
export_max_rows: 1000  # Max rows captured per check per table for the fix sheet
strict_columns: false  # Treat all missing staging columns as errors

Run with: pg-upsert -f config.yaml

Docker

docker run -it --rm \
  -v $(pwd):/app \
  ghcr.io/geocoug/pg-upsert:latest \
  -h host.docker.internal -p 5432 -d dev -u docker \
  -s staging -b public -t genres --commit

QA Checks

pg-upsert runs 7 types of QA checks on staging data before upserting:

Check What it validates
Column Existence PK and NOT NULL (no default) columns must exist in staging (error); other missing columns produce warnings. Use --strict-columns for strict mode.
Column Type No hard type incompatibilities between staging and base (uses PostgreSQL's pg_cast catalog)
NOT NULL Non-nullable base columns have no NULL values in staging
Primary Key No duplicate values in PK columns
Unique No duplicate values in UNIQUE-constrained columns (NULLs allowed per PostgreSQL semantics)
Foreign Key All FK references point to existing rows in the referenced table
Check Constraint All CHECK constraint expressions evaluate to true

[!NOTE] pg-upsert is constraint-driven. Data checks pass vacuously when the base table has no constraints of that type, and tables without a primary key are skipped during the upsert step (a warning is printed). To upsert against a table, make sure the base table has a PK. See Running Without Constraints for details.

See the QA Checks Reference for detailed documentation.

Authentication

pg-upsert resolves the database password in this order:

  1. Password in URI (Python API only) — postgresql://user:pass@host/db
  2. PGPASSWORD environment variable — standard PostgreSQL convention, works with both CLI and API
  3. Interactive prompt — if neither of the above is set

For CI/CD pipelines, use PGPASSWORD to avoid interactive prompts:

PGPASSWORD=secret pg-upsert -h host -d db -u user \
  -s staging -b public -t books \
  --output json --commit

pg-upsert also supports PostgreSQL's .pgpass file via psycopg2.

Exit Codes

Code Meaning
0 QA passed and upsert completed (or user cancelled)
1 QA failed, schema check failed, or error

Contributing

See CONTRIBUTING.md for development setup, available recipes, testing, and release process.

git clone https://github.com/geocoug/pg-upsert
cd pg-upsert
just sync
just test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pg_upsert-1.22.0.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pg_upsert-1.22.0-py3-none-any.whl (98.6 kB view details)

Uploaded Python 3

File details

Details for the file pg_upsert-1.22.0.tar.gz.

File metadata

  • Download URL: pg_upsert-1.22.0.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pg_upsert-1.22.0.tar.gz
Algorithm Hash digest
SHA256 6ba63d3990689685464695c897be081e9d774c375228cc0bf1ee44340e6a6a6f
MD5 f12f1b44616a544e520222a5ce8bf9c0
BLAKE2b-256 1af868b0b02d2e76561cfa580bee320eaf1884666d0d8faa2f2873650e907eef

See more details on using hashes here.

Provenance

The following attestation bundles were made for pg_upsert-1.22.0.tar.gz:

Publisher: ci-cd.yml on geocoug/pg-upsert

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pg_upsert-1.22.0-py3-none-any.whl.

File metadata

  • Download URL: pg_upsert-1.22.0-py3-none-any.whl
  • Upload date:
  • Size: 98.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pg_upsert-1.22.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c823af17c996ba5cb3ed8a2393882ea4a830a14d04965c090b73164b806f2309
MD5 0c333420eb74b40fd56a29c5db50dfc2
BLAKE2b-256 c3f3965c57af3bc89103d2f04435277c2c5e7e9b2bb9ce0653f434c55a570e4a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pg_upsert-1.22.0-py3-none-any.whl:

Publisher: ci-cd.yml on geocoug/pg-upsert

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page