A CLI toolkit for comparing, analyzing, and exporting data across databases and file formats.
Project description
datatk — Data Toolkit
A CLI toolkit for comparing, analyzing, and exporting data across databases and file formats.
Supports MSSQL, PostgreSQL, Databricks, and Parquet.
Installation
pip install datatk
# or
uv tool install datatk
From Source
Requires Python 3.13+ and uv.
git clone https://github.com/nathanthorell/datatk.git
cd datatk
uv sync --extra dev
Quick Start
datatk --help
datatk data-compare
datatk object-compare
datatk schema-size
datatk db-diagram
datatk export-parquet
datatk proc-tester
datatk view-tester
datatk data-cleanup
Configuration
Environment Variables
Copy .env.example to .env and update the connection strings for your databases:
cp .env.example .env
Connection string formats:
- MSSQL:
Server=host,port;Database=db;UID=user;PWD=pass - MSSQL (Azure AD interactive):
Server=host,port;Database=db;Authentication=ActiveDirectoryInteractive(opens browser for Entra ID login;UIDis optional as a login hint) - PostgreSQL:
postgresql://user:pass@host:port/database - Databricks:
databricks://token:ACCESS_TOKEN@host/catalog?http_path=/sql/1.0/warehouses/ID
Tool Configuration
Copy config-example.toml to config.toml and configure the tools you want to use:
cp config-example.toml config.toml
The [datatk] section sets global defaults (e.g. logging_level) that apply to all tools unless
overridden in a tool-specific section.
Tools
data-compare
Compare data across different database platforms.
datatk data-compare
- Supports MSSQL, PostgreSQL, Databricks, and local files (Parquet, CSV, JSON)
- Compare data using inline SQL, query files, or file paths (
db_type = "file", query set to the file path) - Output options:
left_only,right_only,common,differences, orall - Optional case-insensitive string comparison (
case_insensitive), configurable globally or per comparison - Reports differences and execution time per source (suppress with
show_performance = false)
object-compare
Compare database object definitions across environments (DEV, QA, TEST, PROD).
datatk object-compare
- Supports MSSQL and PostgreSQL
- Object types: stored procedures, views, functions, tables, triggers, sequences, indexes, types, extensions (PostgreSQL), external tables (MSSQL), and foreign keys
- Detects objects that exist in only some environments
- Uses MD5 checksums for efficient definition comparison
schema-size
Analyze storage across databases by measuring schema sizes.
datatk schema-size
- Connects to multiple servers and calculates data and index space in megabytes
- Summary and detail modes
- Comparative reports across servers and databases
db-diagram
Generate ERD diagrams from database metadata.
datatk db-diagram
- Output formats: DBML (default), Mermaid, PlantUML
- Column display modes: all columns, keys only, or table names only
- Hierarchical mode: focus on relationships around a specific base table with directional traversal (up, down, or both)
- Detects relationships from foreign key constraints
export-parquet
Export database objects to Parquet files.
datatk export-parquet
- Connects to MSSQL databases and exports tables or query results
- Configurable batch size
- Tracks export timing per object
proc-tester
Batch test stored procedures with configurable default parameters.
datatk proc-tester
- Executes all stored procedures in a configured schema
- Applies default values for common parameter types
- Reports execution status and timing
view-tester
Batch test database views.
datatk view-tester
- Runs a
SELECT TOP 1 *against each view in a configured schema - Reports execution status and timing
data-cleanup
Delete data using foreign key hierarchy traversal to handle dependencies automatically.
datatk data-cleanup
- Traverses foreign key relationships to determine deletion order
- Summary and execute modes (run summary first to preview)
- Configurable batch size and threshold
Development
Linting and Formatting
uv run ruff check src/ # Run ruff linter
uv run ruff check src/ --fix # Run ruff with auto-fix
uv run mypy src/ # Run mypy type checker
uv run ruff format src/ # Format code with ruff
Or use the Makefile:
make lint # Run ruff and mypy linters
make format # Format code with ruff
make clean # Remove temporary files and virtual environment
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datatk-0.2.0.tar.gz.
File metadata
- Download URL: datatk-0.2.0.tar.gz
- Upload date:
- Size: 102.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
876773b54b745723b57ea093d099c57baf02256618d55594e560af9b601ed0a5
|
|
| MD5 |
b65ffa1b4a69a5bade77ecfbadab06dd
|
|
| BLAKE2b-256 |
e1054ed80742c8dd363d8e9b4f90dcecc67b7de300d9428eaa62493ed1a2c6ba
|
Provenance
The following attestation bundles were made for datatk-0.2.0.tar.gz:
Publisher:
publish.yml on nathanthorell/datatk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datatk-0.2.0.tar.gz -
Subject digest:
876773b54b745723b57ea093d099c57baf02256618d55594e560af9b601ed0a5 - Sigstore transparency entry: 994754042
- Sigstore integration time:
-
Permalink:
nathanthorell/datatk@50ebb172c308d2d93db0bbde20886b4a60d4d2fc -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/nathanthorell
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@50ebb172c308d2d93db0bbde20886b4a60d4d2fc -
Trigger Event:
push
-
Statement type:
File details
Details for the file datatk-0.2.0-py3-none-any.whl.
File metadata
- Download URL: datatk-0.2.0-py3-none-any.whl
- Upload date:
- Size: 102.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87aada78af7ac985ce7367a06726f05154fb0b5772c14759223919d289e50436
|
|
| MD5 |
02f2942cd6dae749534ce15f1ae781df
|
|
| BLAKE2b-256 |
ca9ce5ae481d96c41e8ed8a4d0cdf49372fe425fbeea43ec0e9f5695d06236be
|
Provenance
The following attestation bundles were made for datatk-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on nathanthorell/datatk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datatk-0.2.0-py3-none-any.whl -
Subject digest:
87aada78af7ac985ce7367a06726f05154fb0b5772c14759223919d289e50436 - Sigstore transparency entry: 994754084
- Sigstore integration time:
-
Permalink:
nathanthorell/datatk@50ebb172c308d2d93db0bbde20886b4a60d4d2fc -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/nathanthorell
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@50ebb172c308d2d93db0bbde20886b4a60d4d2fc -
Trigger Event:
push
-
Statement type: