CLI tool for data quality checks and schema drift detection on CSV, Parquet, and JSON files

These details have not been verified by PyPI

Project links

Project description

Pipedog

An open source data quality and schema drift detection tool for analysts and data engineers. Point it at a CSV, Parquet, or JSON file and it will profile the data, auto-generate quality checks, and alert you the moment something changes.

Why Pipedog?

Data pipelines break silently. A column gets renamed upstream, nulls creep into a field that was always clean, a price column suddenly contains strings. These issues reach production before anyone notices.

Pipedog solves this by:

Taking a snapshot of your data's structure and statistics on day one.
Scanning every new file against that snapshot and failing loudly when something drifts.
Explaining what went wrong in plain English, not stack traces.

Installation

With pip (quickest)

pip install pipedog

With Poetry (for development)

git clone https://github.com/JKK-Jishnu/pipedog.git
cd pipedog
poetry install

Dependencies

Package	Purpose
typer	CLI framework
rich	Coloured terminal output
pandas	File reading (CSV, Parquet, JSON)
pyarrow	Parquet support for pandas
duckdb	SQL engine (reserved for future use)
pydantic	Schema validation and JSON I/O

Quick Start

# 1. Profile your file and save a baseline snapshot
pipedog init data/orders.csv

# 2. Tomorrow, when a new file arrives, scan it
pipedog scan data/orders_new.csv

# 3. Explore any file without saving anything
pipedog profile data/orders.csv

Commands

`pipedog init <file>`

Profiles the file and saves two files to .pipedog/:

.pipedog/schema.json — column names, types, null stats, value ranges, timestamps.
.pipedog/checks.json — auto-generated quality rules derived from the baseline.

pipedog init sample_data/orders.csv

What gets auto-generated:

Rule	When generated	Severity
`not_null`	Column had zero nulls at init time	error
`null_rate`	Column had some nulls; threshold = pct + 10	warning
`min_value`	Numeric column; locks in the observed min	error
`max_value`	Numeric column; locks in the observed max	error
`unique`	Every value was distinct (looks like a key)	error

Re-running init refreshes the baseline to the current file.

`pipedog scan <file>`

Compares the file against the baseline and runs all quality checks.

pipedog scan sample_data/orders.csv

Exit codes:

0 — all checks passed (warnings are allowed).
1 — one or more error-severity checks failed.

This makes pipedog scan CI/CD friendly — pipe it into your build and it will fail the pipeline when data quality breaks.

What gets checked:

Schema drift — were columns added, removed, or changed type?
Quality checks — do null rates, value ranges, and uniqueness still match the baseline?

Output example (all passing):

+------------------------------- Pipedog Scan --------------------------------+
| ALL CHECKS PASSED                                                           |
| 10 rows · 7 columns · 17 passed · 0 warnings · 0 failed                    |
+-----------------------------------------------------------------------------+

Passed Checks
  PASS  No nulls found in 'order_id'.
  PASS  'price' maximum is 149.99, within baseline maximum of 149.99.
  ...

Output example (failure):

+------------------------------- Pipedog Scan --------------------------------+
| CHECKS FAILED                                                               |
| 12 rows · 6 columns · 14 passed · 0 warnings · 2 failed                    |
+-----------------------------------------------------------------------------+

Schema Drift Detected
  FAIL  Column 'status' existed in the baseline but is missing from the current file.

Failed Checks
  FAIL  'order_id' has 2 null value(s) (16.67% of rows).

`pipedog profile <file>`

Shows a data summary without saving anything to disk. Useful for exploring a file before committing to a baseline.

pipedog profile sample_data/orders.csv

Output includes:

Total row and column count.
Per-column type, null count, null percentage, unique count.
Min and max for numeric columns.
Up to 3 sample values per column.

Supported File Types

Extension	Format
`.csv`	CSV
`.parquet` `.pq`	Parquet
`.json`	JSON

File type is detected automatically from the extension.

How It Works

pipedog init orders.csv
    │
    ├─ load_file()          reads CSV/Parquet/JSON into a DataFrame
    ├─ profile_dataframe()  computes stats for every column
    ├─ generate_checks()    auto-generates quality rules from the stats
    └─ save_snapshot()      writes .pipedog/schema.json + checks.json

pipedog scan orders_new.csv
    │
    ├─ load_file()          reads the new file
    ├─ load_snapshot()      loads baseline from .pipedog/
    ├─ profile_dataframe()  profiles the new file
    ├─ detect_drift()       compares column structure
    ├─ run_quality_checks() evaluates every rule
    └─ print_scan_results() renders colour-coded report, returns exit code

Project Structure

pipedog/
├── pyproject.toml          # Poetry config and PyPI metadata
├── README.md               # This file
├── sample_data/
│   └── orders.csv          # Example file to test with
└── pipedog/
    ├── __init__.py         # Package version
    ├── main.py             # CLI commands (init, scan, profile)
    ├── schema.py           # Pydantic models (ColumnSchema, DataSchema, etc.)
    ├── profiler.py         # File loading, type inference, statistical profiling
    ├── scanner.py          # Drift detection and quality check evaluation
    └── output.py           # Rich terminal output (tables, panels, colours)

Snapshot Files

After running pipedog init, a .pipedog/ directory is created:

.pipedog/
├── schema.json    # baseline column statistics
└── checks.json    # auto-generated quality rules

These files are plain JSON and human-readable. You can commit them to version control to track schema changes over time, or add .pipedog/ to .gitignore to keep them local.

Example .pipedog/schema.json:

{
  "file": "/data/orders.csv",
  "row_count": 10,
  "column_count": 7,
  "columns": [
    {
      "name": "order_id",
      "dtype": "integer",
      "nullable": false,
      "null_count": 0,
      "null_pct": 0.0,
      "unique_count": 10,
      "sample_values": [1, 2, 3],
      "min_value": 1.0,
      "max_value": 10.0,
      "mean_value": 5.5
    }
  ],
  "captured_at": "2026-03-26T18:34:20.123456+00:00"
}

CI/CD Integration

Because pipedog scan exits with code 1 on failure, it drops straight into any CI pipeline:

GitHub Actions:

- name: Check data quality
  run: pipedog scan data/daily_export.csv

Makefile:

check:
    pipedog scan data/daily_export.csv

Roadmap

pipedog diff — side-by-side comparison of two snapshots
Custom checks via checks.json (regex patterns, allowed value sets)
JSON Lines (.jsonl) support
--output json flag for machine-readable scan results
Excel (.xlsx) support
Slack / webhook notifications on failure

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Apr 21, 2026

0.3.2

Mar 30, 2026

0.3.1

Mar 30, 2026

0.3.0

Mar 30, 2026

0.2.1

Mar 29, 2026

0.2.0

Mar 29, 2026

This version

0.1.0

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipedog-0.1.0.tar.gz (16.7 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pipedog-0.1.0-py3-none-any.whl (20.3 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file pipedog-0.1.0.tar.gz.

File metadata

Download URL: pipedog-0.1.0.tar.gz
Upload date: Mar 29, 2026
Size: 16.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.2 Windows/11

File hashes

Hashes for pipedog-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f6fd86c46a1d27ff467b44fa74cdd431e9bb3961a968dcfb4b719feeb53a44e6`
MD5	`38ab69e86fd9a7efe0c988ff0bf2f022`
BLAKE2b-256	`5591653b2b7eb85efcaca7610dd9836e81e5031d6af1f33bf571d4132f6a0ba8`

See more details on using hashes here.

File details

Details for the file pipedog-0.1.0-py3-none-any.whl.

File metadata

Download URL: pipedog-0.1.0-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 20.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.2 Windows/11

File hashes

Hashes for pipedog-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca1d326a4b05b7af2bf163860ab5e7109201a153ab703e6dd5a092fc37dbf1ed`
MD5	`5ccbdd8e9ab8cbd0fd329a8e93e2cd79`
BLAKE2b-256	`817794f2fcb7fa070ce4af048ee29fd887ceef94d29797ee443ffff3059a1fa6`

See more details on using hashes here.

pipedog 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pipedog

Why Pipedog?

Installation

With pip (quickest)

With Poetry (for development)

Dependencies

Quick Start

Commands

pipedog init <file>

pipedog scan <file>

pipedog profile <file>

Supported File Types

How It Works

Project Structure

Snapshot Files

CI/CD Integration

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`pipedog init <file>`

`pipedog scan <file>`

`pipedog profile <file>`