Data validation library for comparing tables across cloud data warehouses, cloud storage, and databases
Project description
PyCaroline
A Python library for validating data migrations between cloud data warehouses, cloud storage (S3, GCS), and databases. Built on datacompy, PyCaroline provides a unified interface for connecting to various data sources and comparing datasets with detailed reporting.
Why PyCaroline?
Data migrations are risky. Whether you're moving from Snowflake to BigQuery, consolidating data warehouses, comparing S3 files with database tables, or validating ETL pipelines, you need confidence that your data arrived intact. PyCaroline makes this easy.
Features
- 🔌 Multi-database support - Snowflake, BigQuery, Redshift, MySQL, PostgreSQL with unified API
- ☁️ Cloud storage support - Read and compare data from AWS S3 and Google Cloud Storage
- 📁 File format support - Load and compare Parquet, CSV, Excel, JSON, NDJSON, Avro, IPC/Feather files
- 📊 Direct DataFrame input - Compare polars, pandas, or snowpark DataFrames directly
- 🔍 Flexible comparison - Row-level and column-level with configurable tolerances
- 📈 Rich reports - JSON summaries, CSV details, and beautiful HTML reports
- 🖥️ CLI & Python API - Use from command line or integrate into your code
- ⚙️ Configuration-driven - YAML config with environment variable substitution
- 🧪 Well-tested - 90%+ test coverage with property-based tests
- 🐍 Modern Python - Supports Python 3.12 and 3.13
Installation
# Using uv (recommended)
uv add pycaroline
# Using pip
pip install pycaroline
With Database-Specific Dependencies
# Snowflake
uv add "pycaroline[snowflake]"
# BigQuery
uv add "pycaroline[bigquery]"
# Redshift
uv add "pycaroline[redshift]"
# MySQL
uv add "pycaroline[mysql]"
# PostgreSQL
uv add "pycaroline[postgresql]"
# Cloud Storage (S3)
uv add "pycaroline[s3]"
# Cloud Storage (GCS)
uv add "pycaroline[gcs]"
# Pandas DataFrame support
uv add "pycaroline[pandas]"
# All connectors
uv add "pycaroline[all]"
Quick Start
Python API
from pycaroline import DataValidator, ConfigLoader, DataComparator, ComparisonConfig
from pathlib import Path
# Using configuration file
config = ConfigLoader.load(Path("validation_config.yaml"))
validator = DataValidator(config)
results = validator.validate()
for table, result in results.items():
print(f"{table}: {result.matching_rows}/{result.source_row_count} rows match")
Direct DataFrame Comparison
import polars as pl
from pycaroline import DataComparator, ComparisonConfig
source_df = pl.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pl.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})
comparator = DataComparator(ComparisonConfig(
join_columns=["id"],
ignore_case=True,
ignore_spaces=True,
))
result = comparator.compare(source_df, target_df)
print(f"Matching rows: {result.matching_rows}")
print(f"Rows only in source: {len(result.rows_only_in_source)}")
print(f"Rows only in target: {len(result.rows_only_in_target)}")
Compare Pandas DataFrames
import pandas as pd
from pycaroline import compare_dataframes
source_df = pd.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pd.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})
result = compare_dataframes(source_df, target_df, join_columns=["id"])
print(f"Matching rows: {result.matching_rows}")
Compare S3 Files
from pycaroline.connectors import S3Connector
with S3Connector(bucket="my-bucket") as conn:
source_df = conn.query("data/source.parquet")
target_df = conn.query("data/target.parquet")
result = compare_dataframes(source_df, target_df, join_columns=["id"])
Compare Data Files
from pycaroline import load_file, compare_files
# Load any supported file format
df = load_file("data.parquet")
df = load_file("data.csv", delimiter=";")
df = load_file("data.xlsx", sheet_name="Sheet2")
# Compare two files directly
result = compare_files(
"source.csv",
"target.parquet",
join_columns=["id"],
ignore_case=True
)
print(f"Matching rows: {result.matching_rows}")
Command Line
# Validate using config file
pycaroline validate --config validation_config.yaml --output ./reports
# Quick comparison
pycaroline compare \
--source-type snowflake \
--target-type bigquery \
--source-table my_schema.customers \
--target-table my_dataset.customers \
--join-columns customer_id
# Compare data files
pycaroline compare-files \
--source source.csv \
--target target.parquet \
--join-columns id \
--output ./reports
Configuration
Create a validation_config.yaml:
source:
type: snowflake
connection:
account: ${SNOWFLAKE_ACCOUNT}
user: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASSWORD}
warehouse: ${SNOWFLAKE_WAREHOUSE}
database: my_database
target:
type: bigquery
connection:
project: ${GCP_PROJECT}
credentials_path: ${GOOGLE_APPLICATION_CREDENTIALS}
tables:
- source_table: customers
target_table: customers
join_columns: [customer_id]
sample_size: 10000 # Optional: limit for large tables
comparison:
abs_tol: 0.0001
ignore_case: false
ignore_spaces: true
output_dir: ./validation_results
Report Output
validation_results/
├── customers_summary.json # Match statistics
├── customers_report.html # Visual HTML report
├── customers_column_stats.csv # Column-level stats
├── customers_rows_only_in_source.csv
├── customers_rows_only_in_target.csv
└── customers_mismatched_rows.csv
Documentation
Full documentation is available at https://yourusername.github.io/pycaroline
API Reference
Core Classes
| Class | Description |
|---|---|
DataValidator |
Main orchestrator for validation workflows |
ConfigLoader |
Loads YAML configuration with env var substitution |
DataComparator |
Compares DataFrames using datacompy |
ReportGenerator |
Generates JSON, CSV, and HTML reports |
ConnectorFactory |
Factory for creating database connectors |
FileLoader |
Loads files into polars DataFrames |
Functions
| Function | Description |
|---|---|
load_file |
Load any supported file format into a DataFrame |
compare_files |
Compare two files and return comparison results |
compare_dataframes |
Compare two DataFrames directly |
Exceptions
| Exception | Description |
|---|---|
ValidationError |
Validation operation failed |
ConfigurationError |
Invalid configuration |
ConnectionError |
Database connection failed |
QueryError |
Query execution failed |
FileLoadError |
File loading failed |
Development
# Clone and install
git clone https://github.com/ryankarlos/pycaroline.git
cd pycaroline
uv sync --all-extras
# Run tests
uv run pytest
# Run with coverage
uv run pytest --cov=pycaroline --cov-report=html
# Serve documentation locally
uv run mkdocs serve
# Lint and format
uv run ruff check .
uv run ruff format .
Contributing
Contributions welcome! Please read CONTRIBUTING.md before submitting PRs.
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycaroline-0.3.tar.gz.
File metadata
- Download URL: pycaroline-0.3.tar.gz
- Upload date:
- Size: 93.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52432effc722be7c2ed3314f171b0d5114c779d463121ee749251b1dc99966bf
|
|
| MD5 |
b052e0690de4529918069df9b6d73635
|
|
| BLAKE2b-256 |
9089b45758d634e6c532786c2107ea9f26ce101384b405d152f34f910b4a82af
|
Provenance
The following attestation bundles were made for pycaroline-0.3.tar.gz:
Publisher:
publish.yml on ryankarlos/pycaroline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pycaroline-0.3.tar.gz -
Subject digest:
52432effc722be7c2ed3314f171b0d5114c779d463121ee749251b1dc99966bf - Sigstore transparency entry: 787189956
- Sigstore integration time:
-
Permalink:
ryankarlos/pycaroline@1c74f5be82c209eef4b29fa032feeb2aafb79887 -
Branch / Tag:
refs/tags/v0.3 - Owner: https://github.com/ryankarlos
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1c74f5be82c209eef4b29fa032feeb2aafb79887 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pycaroline-0.3-py3-none-any.whl.
File metadata
- Download URL: pycaroline-0.3-py3-none-any.whl
- Upload date:
- Size: 52.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5969e4b08e61793f1e182b550734e039081377161d45b0a6b43e9334aa8841c0
|
|
| MD5 |
b04263d6069b54b0c51880fd1f9b5f4f
|
|
| BLAKE2b-256 |
8956efe17755e21baef1c9df5b53cae0e4b7cf63cd62d93c040b5150aeb06bf5
|
Provenance
The following attestation bundles were made for pycaroline-0.3-py3-none-any.whl:
Publisher:
publish.yml on ryankarlos/pycaroline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pycaroline-0.3-py3-none-any.whl -
Subject digest:
5969e4b08e61793f1e182b550734e039081377161d45b0a6b43e9334aa8841c0 - Sigstore transparency entry: 787189957
- Sigstore integration time:
-
Permalink:
ryankarlos/pycaroline@1c74f5be82c209eef4b29fa032feeb2aafb79887 -
Branch / Tag:
refs/tags/v0.3 - Owner: https://github.com/ryankarlos
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1c74f5be82c209eef4b29fa032feeb2aafb79887 -
Trigger Event:
push
-
Statement type: