Data validation library for comparing tables across cloud data warehouses
Project description
PyCaroline
A Python library for validating data migrations between cloud data warehouses. Built on datacompy, PyCaroline provides a unified interface for connecting[...]
Named in honor of Caroline 💜
Why PyCaroline?
Data migrations are risky. Whether you're moving from Snowflake to BigQuery, consolidating data warehouses, or validating ETL pipelines, you need confidence that your data arrived intact. PyCaroli[...]
Features
- 🔌 Multi-database support - Snowflake, BigQuery, Redshift with unified API
- 🔍 Flexible comparison - Row-level and column-level with configurable tolerances
- 📊 Rich reports - JSON summaries, CSV details, and beautiful HTML reports
- 🖥️ CLI & Python API - Use from command line or integrate into your code
- ⚙️ Configuration-driven - YAML config with environment variable substitution
- 🧪 Well-tested - 90%+ test coverage with property-based tests
- 🐍 Modern Python - Supports Python 3.12 and 3.13
Installation
# Using uv (recommended)
uv add pycaroline
# Using pip
pip install pycaroline
With Database-Specific Dependencies
# Snowflake
uv add "pycaroline[snowflake]"
# BigQuery
uv add "pycaroline[bigquery]"
# Redshift
uv add "pycaroline[redshift]"
# All databases
uv add "pycaroline[all]"
Quick Start
Python API
from pycaroline import DataValidator, ConfigLoader, DataComparator, ComparisonConfig
from pathlib import Path
# Using configuration file
config = ConfigLoader.load(Path("validation_config.yaml"))
validator = DataValidator(config)
results = validator.validate()
for table, result in results.items():
print(f"{table}: {result.matching_rows}/{result.source_row_count} rows match")
Direct DataFrame Comparison
import polars as pl
from pycaroline import DataComparator, ComparisonConfig
source_df = pl.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pl.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})
comparator = DataComparator(ComparisonConfig(
join_columns=["id"],
ignore_case=True,
ignore_spaces=True,
))
result = comparator.compare(source_df, target_df)
print(f"Matching rows: {result.matching_rows}")
print(f"Rows only in source: {len(result.rows_only_in_source)}")
print(f"Rows only in target: {len(result.rows_only_in_target)}")
Command Line
# Validate using config file
pycaroline validate --config validation_config.yaml --output ./reports
# Quick comparison
pycaroline compare \
--source-type snowflake \
--target-type bigquery \
--source-table my_schema.customers \
--target-table my_dataset.customers \
--join-columns customer_id
Configuration
Create a validation_config.yaml:
source:
type: snowflake
connection:
account: ${SNOWFLAKE_ACCOUNT}
user: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASSWORD}
warehouse: ${SNOWFLAKE_WAREHOUSE}
database: my_database
target:
type: bigquery
connection:
project: ${GCP_PROJECT}
credentials_path: ${GOOGLE_APPLICATION_CREDENTIALS}
tables:
- source_table: customers
target_table: customers
join_columns: [customer_id]
sample_size: 10000 # Optional: limit for large tables
comparison:
abs_tol: 0.0001
ignore_case: false
ignore_spaces: true
output_dir: ./validation_results
Report Output
validation_results/
├── customers_summary.json # Match statistics
├── customers_report.html # Visual HTML report
├── customers_column_stats.csv # Column-level stats
├── customers_rows_only_in_source.csv
├── customers_rows_only_in_target.csv
└── customers_mismatched_rows.csv
Documentation
Full documentation is available at https://yourusername.github.io/pycaroline
API Reference
Core Classes
| Class | Description |
|---|---|
DataValidator |
Main orchestrator for validation workflows |
ConfigLoader |
Loads YAML configuration with env var substitution |
DataComparator |
Compares DataFrames using datacompy |
ReportGenerator |
Generates JSON, CSV, and HTML reports |
ConnectorFactory |
Factory for creating database connectors |
Exceptions
| Exception | Description |
|---|---|
ValidationError |
Validation operation failed |
ConfigurationError |
Invalid configuration |
ConnectionError |
Database connection failed |
QueryError |
Query execution failed |
Development
# Clone and install
git clone https://github.com/ryankarlos/pycaroline.git
cd pycaroline
uv sync --all-extras
# Run tests
uv run pytest
# Run with coverage
uv run pytest --cov=pycaroline --cov-report=html
# Serve documentation locally
uv run mkdocs serve
# Lint and format
uv run ruff check .
uv run ruff format .
Contributing
Contributions welcome! Please read CONTRIBUTING.md before submitting PRs.
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycaroline-0.1.0.tar.gz.
File metadata
- Download URL: pycaroline-0.1.0.tar.gz
- Upload date:
- Size: 51.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
924078901b192c07b9211dd7b40aa2d062a7c0d7ea1c55f8c23b64883f4b932a
|
|
| MD5 |
237e322cc1acd1961f01f523c43ed4d7
|
|
| BLAKE2b-256 |
b259ad89822ba9b0521d9be592f031fc84f0182a301156b823c3c32ad6fd7ffd
|
Provenance
The following attestation bundles were made for pycaroline-0.1.0.tar.gz:
Publisher:
publish.yml on ryankarlos/pycaroline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pycaroline-0.1.0.tar.gz -
Subject digest:
924078901b192c07b9211dd7b40aa2d062a7c0d7ea1c55f8c23b64883f4b932a - Sigstore transparency entry: 786136602
- Sigstore integration time:
-
Permalink:
ryankarlos/pycaroline@5301f46d00b27f38aad48f6f82ef4bb3e1627a95 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ryankarlos
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5301f46d00b27f38aad48f6f82ef4bb3e1627a95 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pycaroline-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pycaroline-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e760135198863932a37aa0ea7a40d2cb8cc55cb9f768d5dce1d9758672db18e
|
|
| MD5 |
fc9e86e686ec4022ef6c86fc873849c6
|
|
| BLAKE2b-256 |
7bd45d506de951490d61f0706437736761e64deba59d468fa5464d67ad0c7eb7
|
Provenance
The following attestation bundles were made for pycaroline-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on ryankarlos/pycaroline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pycaroline-0.1.0-py3-none-any.whl -
Subject digest:
7e760135198863932a37aa0ea7a40d2cb8cc55cb9f768d5dce1d9758672db18e - Sigstore transparency entry: 786136615
- Sigstore integration time:
-
Permalink:
ryankarlos/pycaroline@5301f46d00b27f38aad48f6f82ef4bb3e1627a95 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ryankarlos
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5301f46d00b27f38aad48f6f82ef4bb3e1627a95 -
Trigger Event:
push
-
Statement type: