Skip to main content

A CLI toolkit for comparing, analyzing, and exporting data across databases and file formats.

Project description

datatk — Data Toolkit

A CLI toolkit for comparing, analyzing, and exporting data across databases and file formats.

Supports MSSQL, PostgreSQL, Databricks, and Parquet.

Installation

pip install datatk
# or
uv tool install datatk

From Source

Requires Python 3.13+ and uv.

git clone https://github.com/nathanthorell/datatk.git
cd datatk
uv sync --extra dev

Quick Start

datatk --help
datatk data-compare
datatk object-compare
datatk schema-size
datatk db-diagram
datatk export-parquet
datatk proc-tester
datatk view-tester
datatk data-cleanup

Configuration

Environment Variables

Copy .env.example to .env and update the connection strings for your databases:

cp .env.example .env

Connection string formats:

  • MSSQL: Server=host,port;Database=db;UID=user;PWD=pass
  • PostgreSQL: postgresql://user:pass@host:port/database
  • Databricks: databricks://token:ACCESS_TOKEN@host/catalog?http_path=/sql/1.0/warehouses/ID

Tool Configuration

Copy config-example.toml to config.toml and configure the tools you want to use:

cp config-example.toml config.toml

The [datatk] section sets global defaults (e.g. logging_level) that apply to all tools unless overridden in a tool-specific section.

Tools

data-compare

Compare data across different database platforms.

datatk data-compare
  • Supports MSSQL, PostgreSQL, and Databricks
  • Compare data using inline SQL or query files
  • Output options: left_only, right_only, common, differences, or all
  • Reports differences and execution time per source

object-compare

Compare database object definitions across environments (DEV, QA, TEST, PROD).

datatk object-compare
  • Supports MSSQL and PostgreSQL
  • Object types: stored procedures, views, functions, tables, triggers, sequences, indexes, types, extensions (PostgreSQL), external tables (MSSQL), and foreign keys
  • Detects objects that exist in only some environments
  • Uses MD5 checksums for efficient definition comparison

schema-size

Analyze storage across databases by measuring schema sizes.

datatk schema-size
  • Connects to multiple servers and calculates data and index space in megabytes
  • Summary and detail modes
  • Comparative reports across servers and databases

db-diagram

Generate ERD diagrams from database metadata.

datatk db-diagram
  • Output formats: DBML (default), Mermaid, PlantUML
  • Column display modes: all columns, keys only, or table names only
  • Hierarchical mode: focus on relationships around a specific base table with directional traversal (up, down, or both)
  • Detects relationships from foreign key constraints

export-parquet

Export database objects to Parquet files.

datatk export-parquet
  • Connects to MSSQL databases and exports tables or query results
  • Configurable batch size
  • Tracks export timing per object

proc-tester

Batch test stored procedures with configurable default parameters.

datatk proc-tester
  • Executes all stored procedures in a configured schema
  • Applies default values for common parameter types
  • Reports execution status and timing

view-tester

Batch test database views.

datatk view-tester
  • Runs a SELECT TOP 1 * against each view in a configured schema
  • Reports execution status and timing

data-cleanup

Delete data using foreign key hierarchy traversal to handle dependencies automatically.

datatk data-cleanup
  • Traverses foreign key relationships to determine deletion order
  • Summary and execute modes (run summary first to preview)
  • Configurable batch size and threshold

Development

Linting and Formatting

uv run ruff check src/        # Run ruff linter
uv run ruff check src/ --fix  # Run ruff with auto-fix
uv run mypy src/              # Run mypy type checker
uv run ruff format src/       # Format code with ruff

Or use the Makefile:

make lint    # Run ruff and mypy linters
make format  # Format code with ruff
make clean   # Remove temporary files and virtual environment

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatk-0.1.0.tar.gz (100.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datatk-0.1.0-py3-none-any.whl (99.7 kB view details)

Uploaded Python 3

File details

Details for the file datatk-0.1.0.tar.gz.

File metadata

  • Download URL: datatk-0.1.0.tar.gz
  • Upload date:
  • Size: 100.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datatk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e8c983b6b38d17deabe33dd3b21d61100818d8230319d6e4c9ee22d513a34cf1
MD5 bfd551f4bb46e5c1ed6844e5e91a4469
BLAKE2b-256 37a547e90f6fbaba70b8d5d583649bce7a5b1e2616bb37de796b5b46b3694b11

See more details on using hashes here.

Provenance

The following attestation bundles were made for datatk-0.1.0.tar.gz:

Publisher: publish.yml on nathanthorell/datatk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datatk-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: datatk-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 99.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datatk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5fe11fc448732ebf9c4c7d9ff7f4801f8e90a191492588dd86ceff91ff1be29b
MD5 6d25f232891d6a79c2bac223b0633dcd
BLAKE2b-256 d57f126afba9dc69b858eba53f1ab137b586d5ab8f8c51229e2dbb68b6d0e7a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for datatk-0.1.0-py3-none-any.whl:

Publisher: publish.yml on nathanthorell/datatk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page