Data validation library for comparing tables across cloud data warehouses

These details have not been verified by PyPI

Project description

PyCaroline

A Python library for validating data migrations between cloud data warehouses. Built on datacompy, PyCaroline provides a unified interface for connecting[...]

Named in honor of Caroline 💜

Why PyCaroline?

Data migrations are risky. Whether you're moving from Snowflake to BigQuery, consolidating data warehouses, or validating ETL pipelines, you need confidence that your data arrived intact. PyCaroli[...]

Features

🔌 Multi-database support - Snowflake, BigQuery, Redshift with unified API
🔍 Flexible comparison - Row-level and column-level with configurable tolerances
📊 Rich reports - JSON summaries, CSV details, and beautiful HTML reports
🖥️ CLI & Python API - Use from command line or integrate into your code
⚙️ Configuration-driven - YAML config with environment variable substitution
🧪 Well-tested - 90%+ test coverage with property-based tests
🐍 Modern Python - Supports Python 3.12 and 3.13

Installation

# Using uv (recommended)
uv add pycaroline

# Using pip
pip install pycaroline

With Database-Specific Dependencies

# Snowflake
uv add "pycaroline[snowflake]"

# BigQuery
uv add "pycaroline[bigquery]"

# Redshift
uv add "pycaroline[redshift]"

# All databases
uv add "pycaroline[all]"

Quick Start

Python API

from pycaroline import DataValidator, ConfigLoader, DataComparator, ComparisonConfig
from pathlib import Path

# Using configuration file
config = ConfigLoader.load(Path("validation_config.yaml"))
validator = DataValidator(config)
results = validator.validate()

for table, result in results.items():
    print(f"{table}: {result.matching_rows}/{result.source_row_count} rows match")

Direct DataFrame Comparison

import polars as pl
from pycaroline import DataComparator, ComparisonConfig

source_df = pl.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pl.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})

comparator = DataComparator(ComparisonConfig(
    join_columns=["id"],
    ignore_case=True,
    ignore_spaces=True,
))
result = comparator.compare(source_df, target_df)

print(f"Matching rows: {result.matching_rows}")
print(f"Rows only in source: {len(result.rows_only_in_source)}")
print(f"Rows only in target: {len(result.rows_only_in_target)}")

Command Line

# Validate using config file
pycaroline validate --config validation_config.yaml --output ./reports

# Quick comparison
pycaroline compare \
    --source-type snowflake \
    --target-type bigquery \
    --source-table my_schema.customers \
    --target-table my_dataset.customers \
    --join-columns customer_id

Configuration

Create a validation_config.yaml:

source:
  type: snowflake
  connection:
    account: ${SNOWFLAKE_ACCOUNT}
    user: ${SNOWFLAKE_USER}
    password: ${SNOWFLAKE_PASSWORD}
    warehouse: ${SNOWFLAKE_WAREHOUSE}
    database: my_database

target:
  type: bigquery
  connection:
    project: ${GCP_PROJECT}
    credentials_path: ${GOOGLE_APPLICATION_CREDENTIALS}

tables:
  - source_table: customers
    target_table: customers
    join_columns: [customer_id]
    sample_size: 10000  # Optional: limit for large tables

comparison:
  abs_tol: 0.0001
  ignore_case: false
  ignore_spaces: true

output_dir: ./validation_results

Report Output

validation_results/
├── customers_summary.json       # Match statistics
├── customers_report.html        # Visual HTML report
├── customers_column_stats.csv   # Column-level stats
├── customers_rows_only_in_source.csv
├── customers_rows_only_in_target.csv
└── customers_mismatched_rows.csv

Documentation

Full documentation is available at https://yourusername.github.io/pycaroline

API Reference

Core Classes

Class	Description
`DataValidator`	Main orchestrator for validation workflows
`ConfigLoader`	Loads YAML configuration with env var substitution
`DataComparator`	Compares DataFrames using datacompy
`ReportGenerator`	Generates JSON, CSV, and HTML reports
`ConnectorFactory`	Factory for creating database connectors

Exceptions

Exception	Description
`ValidationError`	Validation operation failed
`ConfigurationError`	Invalid configuration
`ConnectionError`	Database connection failed
`QueryError`	Query execution failed

Development

# Clone and install
git clone https://github.com/ryankarlos/pycaroline.git
cd pycaroline
uv sync --all-extras

# Run tests
uv run pytest

# Run with coverage
uv run pytest --cov=pycaroline --cov-report=html

# Serve documentation locally
uv run mkdocs serve

# Lint and format
uv run ruff check .
uv run ruff format .

Contributing

Contributions welcome! Please read CONTRIBUTING.md before submitting PRs.

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3

Jan 1, 2026

0.2

Jan 1, 2026

This version

0.1.0

Dec 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycaroline-0.1.0.tar.gz (51.1 kB view details)

Uploaded Dec 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycaroline-0.1.0-py3-none-any.whl (32.8 kB view details)

Uploaded Dec 31, 2025 Python 3

File details

Details for the file pycaroline-0.1.0.tar.gz.

File metadata

Download URL: pycaroline-0.1.0.tar.gz
Upload date: Dec 31, 2025
Size: 51.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`924078901b192c07b9211dd7b40aa2d062a7c0d7ea1c55f8c23b64883f4b932a`
MD5	`237e322cc1acd1961f01f523c43ed4d7`
BLAKE2b-256	`b259ad89822ba9b0521d9be592f031fc84f0182a301156b823c3c32ad6fd7ffd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.1.0.tar.gz:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pycaroline-0.1.0.tar.gz
- Subject digest: 924078901b192c07b9211dd7b40aa2d062a7c0d7ea1c55f8c23b64883f4b932a
- Sigstore transparency entry: 786136602
- Sigstore integration time: Dec 31, 2025
Source repository:
- Permalink: ryankarlos/pycaroline@5301f46d00b27f38aad48f6f82ef4bb3e1627a95
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ryankarlos
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5301f46d00b27f38aad48f6f82ef4bb3e1627a95
- Trigger Event: push

File details

Details for the file pycaroline-0.1.0-py3-none-any.whl.

File metadata

Download URL: pycaroline-0.1.0-py3-none-any.whl
Upload date: Dec 31, 2025
Size: 32.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e760135198863932a37aa0ea7a40d2cb8cc55cb9f768d5dce1d9758672db18e`
MD5	`fc9e86e686ec4022ef6c86fc873849c6`
BLAKE2b-256	`7bd45d506de951490d61f0706437736761e64deba59d468fa5464d67ad0c7eb7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.1.0-py3-none-any.whl:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pycaroline-0.1.0-py3-none-any.whl
- Subject digest: 7e760135198863932a37aa0ea7a40d2cb8cc55cb9f768d5dce1d9758672db18e
- Sigstore transparency entry: 786136615
- Sigstore integration time: Dec 31, 2025
Source repository:
- Permalink: ryankarlos/pycaroline@5301f46d00b27f38aad48f6f82ef4bb3e1627a95
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ryankarlos
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5301f46d00b27f38aad48f6f82ef4bb3e1627a95
- Trigger Event: push

pycaroline 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

PyCaroline

Why PyCaroline?

Features

Installation

With Database-Specific Dependencies

Quick Start

Python API

Direct DataFrame Comparison

Command Line

Configuration

Report Output

Documentation

API Reference

Core Classes

Exceptions

Development

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance