Skip to main content

Data validation library for comparing tables across cloud data warehouses, cloud storage, and databases

Project description

PyCaroline

PyPI version Python 3.12+ License: MIT Tests Codecov Documentation

A Python library for validating data migrations between cloud data warehouses, cloud storage (S3, GCS), and databases. Built on datacompy, PyCaroline provides a unified interface for connecting to various data sources and comparing datasets with detailed reporting.

Why PyCaroline?

Data migrations are risky. Whether you're moving from Snowflake to BigQuery, consolidating data warehouses, comparing S3 files with database tables, or validating ETL pipelines, you need confidence that your data arrived intact. PyCaroline makes this easy.

Features

  • 🔌 Multi-database support - Snowflake, BigQuery, Redshift, MySQL, PostgreSQL with unified API
  • ☁️ Cloud storage support - Read and compare data from AWS S3 and Google Cloud Storage
  • 📊 Direct DataFrame input - Compare polars, pandas, or snowpark DataFrames directly
  • 🔍 Flexible comparison - Row-level and column-level with configurable tolerances
  • 📈 Rich reports - JSON summaries, CSV details, and beautiful HTML reports
  • 🖥️ CLI & Python API - Use from command line or integrate into your code
  • ⚙️ Configuration-driven - YAML config with environment variable substitution
  • 🧪 Well-tested - 90%+ test coverage with property-based tests
  • 🐍 Modern Python - Supports Python 3.12 and 3.13

Installation

# Using uv (recommended)
uv add pycaroline

# Using pip
pip install pycaroline

With Database-Specific Dependencies

# Snowflake
uv add "pycaroline[snowflake]"

# BigQuery
uv add "pycaroline[bigquery]"

# Redshift
uv add "pycaroline[redshift]"

# MySQL
uv add "pycaroline[mysql]"

# PostgreSQL
uv add "pycaroline[postgresql]"

# Cloud Storage (S3)
uv add "pycaroline[s3]"

# Cloud Storage (GCS)
uv add "pycaroline[gcs]"

# Pandas DataFrame support
uv add "pycaroline[pandas]"

# All connectors
uv add "pycaroline[all]"

Quick Start

Python API

from pycaroline import DataValidator, ConfigLoader, DataComparator, ComparisonConfig
from pathlib import Path

# Using configuration file
config = ConfigLoader.load(Path("validation_config.yaml"))
validator = DataValidator(config)
results = validator.validate()

for table, result in results.items():
    print(f"{table}: {result.matching_rows}/{result.source_row_count} rows match")

Direct DataFrame Comparison

import polars as pl
from pycaroline import DataComparator, ComparisonConfig

source_df = pl.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pl.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})

comparator = DataComparator(ComparisonConfig(
    join_columns=["id"],
    ignore_case=True,
    ignore_spaces=True,
))
result = comparator.compare(source_df, target_df)

print(f"Matching rows: {result.matching_rows}")
print(f"Rows only in source: {len(result.rows_only_in_source)}")
print(f"Rows only in target: {len(result.rows_only_in_target)}")

Compare Pandas DataFrames

import pandas as pd
from pycaroline import compare_dataframes

source_df = pd.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pd.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})

result = compare_dataframes(source_df, target_df, join_columns=["id"])
print(f"Matching rows: {result.matching_rows}")

Compare S3 Files

from pycaroline.connectors import S3Connector

with S3Connector(bucket="my-bucket") as conn:
    source_df = conn.query("data/source.parquet")
    target_df = conn.query("data/target.parquet")

result = compare_dataframes(source_df, target_df, join_columns=["id"])

Command Line

# Validate using config file
pycaroline validate --config validation_config.yaml --output ./reports

# Quick comparison
pycaroline compare \
    --source-type snowflake \
    --target-type bigquery \
    --source-table my_schema.customers \
    --target-table my_dataset.customers \
    --join-columns customer_id

Configuration

Create a validation_config.yaml:

source:
  type: snowflake
  connection:
    account: ${SNOWFLAKE_ACCOUNT}
    user: ${SNOWFLAKE_USER}
    password: ${SNOWFLAKE_PASSWORD}
    warehouse: ${SNOWFLAKE_WAREHOUSE}
    database: my_database

target:
  type: bigquery
  connection:
    project: ${GCP_PROJECT}
    credentials_path: ${GOOGLE_APPLICATION_CREDENTIALS}

tables:
  - source_table: customers
    target_table: customers
    join_columns: [customer_id]
    sample_size: 10000  # Optional: limit for large tables

comparison:
  abs_tol: 0.0001
  ignore_case: false
  ignore_spaces: true

output_dir: ./validation_results

Report Output

validation_results/
├── customers_summary.json       # Match statistics
├── customers_report.html        # Visual HTML report
├── customers_column_stats.csv   # Column-level stats
├── customers_rows_only_in_source.csv
├── customers_rows_only_in_target.csv
└── customers_mismatched_rows.csv

Documentation

Full documentation is available at https://yourusername.github.io/pycaroline

API Reference

Core Classes

Class Description
DataValidator Main orchestrator for validation workflows
ConfigLoader Loads YAML configuration with env var substitution
DataComparator Compares DataFrames using datacompy
ReportGenerator Generates JSON, CSV, and HTML reports
ConnectorFactory Factory for creating database connectors

Exceptions

Exception Description
ValidationError Validation operation failed
ConfigurationError Invalid configuration
ConnectionError Database connection failed
QueryError Query execution failed

Development

# Clone and install
git clone https://github.com/ryankarlos/pycaroline.git
cd pycaroline
uv sync --all-extras

# Run tests
uv run pytest

# Run with coverage
uv run pytest --cov=pycaroline --cov-report=html

# Serve documentation locally
uv run mkdocs serve

# Lint and format
uv run ruff check .
uv run ruff format .

Contributing

Contributions welcome! Please read CONTRIBUTING.md before submitting PRs.

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycaroline-0.2.tar.gz (76.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pycaroline-0.2.0-py3-none-any.whl (44.8 kB view details)

Uploaded Python 3

pycaroline-0.2-py3-none-any.whl (44.7 kB view details)

Uploaded Python 3

File details

Details for the file pycaroline-0.2.tar.gz.

File metadata

  • Download URL: pycaroline-0.2.tar.gz
  • Upload date:
  • Size: 76.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.2.tar.gz
Algorithm Hash digest
SHA256 f9db51d69ee23fa2c6acd9d78406fad42cf04ec45d30135f4cc2d1ea473fae43
MD5 0eec317e37ffe94c7d83dbc44fb433fc
BLAKE2b-256 9f0ec612bfbf94404a019c8348e38eb2e0396d1d51f74375ea1d5b9f1d2596c9

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.2.tar.gz:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycaroline-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pycaroline-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 44.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51c62c523c253a888fe86dc2e5c0c4b0047d4dfd403f9711584e4de7ca3d210e
MD5 a5e5cad739314d9dbf63f44768429316
BLAKE2b-256 0a0b16b672cb4c4d9775678ebd951ac65b667ebc96328690dcedcc0d155550bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.2.0-py3-none-any.whl:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycaroline-0.2-py3-none-any.whl.

File metadata

  • Download URL: pycaroline-0.2-py3-none-any.whl
  • Upload date:
  • Size: 44.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 177751e06a187dd92df4de23f6b44045e70112f7c2de4b1dd859a86c11835dd7
MD5 da350c7fb1d503610b3b4cef69f2f938
BLAKE2b-256 a8d9177d7e1061655854efc715c666641d445cc0af42c8ad0d2dbdbe04b31f23

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.2-py3-none-any.whl:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page