Skip to main content

Data validation library for comparing tables across cloud data warehouses

Project description

PyCaroline

PyPI version Python 3.12+ License: MIT Tests Codecov Documentation

A Python library for validating data migrations between cloud data warehouses. Built on datacompy, PyCaroline provides a unified interface for connecting[...]

Named in honor of Caroline 💜

Why PyCaroline?

Data migrations are risky. Whether you're moving from Snowflake to BigQuery, consolidating data warehouses, or validating ETL pipelines, you need confidence that your data arrived intact. PyCaroli[...]

Features

  • 🔌 Multi-database support - Snowflake, BigQuery, Redshift with unified API
  • 🔍 Flexible comparison - Row-level and column-level with configurable tolerances
  • 📊 Rich reports - JSON summaries, CSV details, and beautiful HTML reports
  • 🖥️ CLI & Python API - Use from command line or integrate into your code
  • ⚙️ Configuration-driven - YAML config with environment variable substitution
  • 🧪 Well-tested - 90%+ test coverage with property-based tests
  • 🐍 Modern Python - Supports Python 3.12 and 3.13

Installation

# Using uv (recommended)
uv add pycaroline

# Using pip
pip install pycaroline

With Database-Specific Dependencies

# Snowflake
uv add "pycaroline[snowflake]"

# BigQuery
uv add "pycaroline[bigquery]"

# Redshift
uv add "pycaroline[redshift]"

# All databases
uv add "pycaroline[all]"

Quick Start

Python API

from pycaroline import DataValidator, ConfigLoader, DataComparator, ComparisonConfig
from pathlib import Path

# Using configuration file
config = ConfigLoader.load(Path("validation_config.yaml"))
validator = DataValidator(config)
results = validator.validate()

for table, result in results.items():
    print(f"{table}: {result.matching_rows}/{result.source_row_count} rows match")

Direct DataFrame Comparison

import polars as pl
from pycaroline import DataComparator, ComparisonConfig

source_df = pl.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pl.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})

comparator = DataComparator(ComparisonConfig(
    join_columns=["id"],
    ignore_case=True,
    ignore_spaces=True,
))
result = comparator.compare(source_df, target_df)

print(f"Matching rows: {result.matching_rows}")
print(f"Rows only in source: {len(result.rows_only_in_source)}")
print(f"Rows only in target: {len(result.rows_only_in_target)}")

Command Line

# Validate using config file
pycaroline validate --config validation_config.yaml --output ./reports

# Quick comparison
pycaroline compare \
    --source-type snowflake \
    --target-type bigquery \
    --source-table my_schema.customers \
    --target-table my_dataset.customers \
    --join-columns customer_id

Configuration

Create a validation_config.yaml:

source:
  type: snowflake
  connection:
    account: ${SNOWFLAKE_ACCOUNT}
    user: ${SNOWFLAKE_USER}
    password: ${SNOWFLAKE_PASSWORD}
    warehouse: ${SNOWFLAKE_WAREHOUSE}
    database: my_database

target:
  type: bigquery
  connection:
    project: ${GCP_PROJECT}
    credentials_path: ${GOOGLE_APPLICATION_CREDENTIALS}

tables:
  - source_table: customers
    target_table: customers
    join_columns: [customer_id]
    sample_size: 10000  # Optional: limit for large tables

comparison:
  abs_tol: 0.0001
  ignore_case: false
  ignore_spaces: true

output_dir: ./validation_results

Report Output

validation_results/
├── customers_summary.json       # Match statistics
├── customers_report.html        # Visual HTML report
├── customers_column_stats.csv   # Column-level stats
├── customers_rows_only_in_source.csv
├── customers_rows_only_in_target.csv
└── customers_mismatched_rows.csv

Documentation

Full documentation is available at https://yourusername.github.io/pycaroline

API Reference

Core Classes

Class Description
DataValidator Main orchestrator for validation workflows
ConfigLoader Loads YAML configuration with env var substitution
DataComparator Compares DataFrames using datacompy
ReportGenerator Generates JSON, CSV, and HTML reports
ConnectorFactory Factory for creating database connectors

Exceptions

Exception Description
ValidationError Validation operation failed
ConfigurationError Invalid configuration
ConnectionError Database connection failed
QueryError Query execution failed

Development

# Clone and install
git clone https://github.com/ryankarlos/pycaroline.git
cd pycaroline
uv sync --all-extras

# Run tests
uv run pytest

# Run with coverage
uv run pytest --cov=pycaroline --cov-report=html

# Serve documentation locally
uv run mkdocs serve

# Lint and format
uv run ruff check .
uv run ruff format .

Contributing

Contributions welcome! Please read CONTRIBUTING.md before submitting PRs.

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycaroline-0.1.0.tar.gz (51.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycaroline-0.1.0-py3-none-any.whl (32.8 kB view details)

Uploaded Python 3

File details

Details for the file pycaroline-0.1.0.tar.gz.

File metadata

  • Download URL: pycaroline-0.1.0.tar.gz
  • Upload date:
  • Size: 51.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.1.0.tar.gz
Algorithm Hash digest
SHA256 924078901b192c07b9211dd7b40aa2d062a7c0d7ea1c55f8c23b64883f4b932a
MD5 237e322cc1acd1961f01f523c43ed4d7
BLAKE2b-256 b259ad89822ba9b0521d9be592f031fc84f0182a301156b823c3c32ad6fd7ffd

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.1.0.tar.gz:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycaroline-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pycaroline-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e760135198863932a37aa0ea7a40d2cb8cc55cb9f768d5dce1d9758672db18e
MD5 fc9e86e686ec4022ef6c86fc873849c6
BLAKE2b-256 7bd45d506de951490d61f0706437736761e64deba59d468fa5464d67ad0c7eb7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.1.0-py3-none-any.whl:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page