Data validation library for comparing tables across cloud data warehouses, cloud storage, and databases

These details have not been verified by PyPI

Project description

PyCaroline

A Python library for validating data migrations between cloud data warehouses, cloud storage (S3, GCS), and databases. Built on datacompy, PyCaroline provides a unified interface for connecting to various data sources and comparing datasets with detailed reporting.

Why PyCaroline?

Data migrations are risky. Whether you're moving from Snowflake to BigQuery, consolidating data warehouses, comparing S3 files with database tables, or validating ETL pipelines, you need confidence that your data arrived intact. PyCaroline makes this easy.

Features

🔌 Multi-database support - Snowflake, BigQuery, Redshift, MySQL, PostgreSQL with unified API
☁️ Cloud storage support - Read and compare data from AWS S3 and Google Cloud Storage
📁 File format support - Load and compare Parquet, CSV, Excel, JSON, NDJSON, Avro, IPC/Feather files
📊 Direct DataFrame input - Compare polars, pandas, or snowpark DataFrames directly
🔍 Flexible comparison - Row-level and column-level with configurable tolerances
📈 Rich reports - JSON summaries, CSV details, and beautiful HTML reports
🖥️ CLI & Python API - Use from command line or integrate into your code
⚙️ Configuration-driven - YAML config with environment variable substitution
🧪 Well-tested - 90%+ test coverage with property-based tests
🐍 Modern Python - Supports Python 3.12 and 3.13

Installation

# Using uv (recommended)
uv add pycaroline

# Using pip
pip install pycaroline

With Database-Specific Dependencies

# Snowflake
uv add "pycaroline[snowflake]"

# BigQuery
uv add "pycaroline[bigquery]"

# Redshift
uv add "pycaroline[redshift]"

# MySQL
uv add "pycaroline[mysql]"

# PostgreSQL
uv add "pycaroline[postgresql]"

# Cloud Storage (S3)
uv add "pycaroline[s3]"

# Cloud Storage (GCS)
uv add "pycaroline[gcs]"

# Pandas DataFrame support
uv add "pycaroline[pandas]"

# All connectors
uv add "pycaroline[all]"

Quick Start

Python API

from pycaroline import DataValidator, ConfigLoader, DataComparator, ComparisonConfig
from pathlib import Path

# Using configuration file
config = ConfigLoader.load(Path("validation_config.yaml"))
validator = DataValidator(config)
results = validator.validate()

for table, result in results.items():
    print(f"{table}: {result.matching_rows}/{result.source_row_count} rows match")

Direct DataFrame Comparison

import polars as pl
from pycaroline import DataComparator, ComparisonConfig

source_df = pl.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pl.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})

comparator = DataComparator(ComparisonConfig(
    join_columns=["id"],
    ignore_case=True,
    ignore_spaces=True,
))
result = comparator.compare(source_df, target_df)

print(f"Matching rows: {result.matching_rows}")
print(f"Rows only in source: {len(result.rows_only_in_source)}")
print(f"Rows only in target: {len(result.rows_only_in_target)}")

Compare Pandas DataFrames

import pandas as pd
from pycaroline import compare_dataframes

source_df = pd.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pd.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})

result = compare_dataframes(source_df, target_df, join_columns=["id"])
print(f"Matching rows: {result.matching_rows}")

Compare S3 Files

from pycaroline.connectors import S3Connector

with S3Connector(bucket="my-bucket") as conn:
    source_df = conn.query("data/source.parquet")
    target_df = conn.query("data/target.parquet")

result = compare_dataframes(source_df, target_df, join_columns=["id"])

Compare Data Files

from pycaroline import load_file, compare_files

# Load any supported file format
df = load_file("data.parquet")
df = load_file("data.csv", delimiter=";")
df = load_file("data.xlsx", sheet_name="Sheet2")

# Compare two files directly
result = compare_files(
    "source.csv",
    "target.parquet",
    join_columns=["id"],
    ignore_case=True
)
print(f"Matching rows: {result.matching_rows}")

Command Line

# Validate using config file
pycaroline validate --config validation_config.yaml --output ./reports

# Quick comparison
pycaroline compare \
    --source-type snowflake \
    --target-type bigquery \
    --source-table my_schema.customers \
    --target-table my_dataset.customers \
    --join-columns customer_id

# Compare data files
pycaroline compare-files \
    --source source.csv \
    --target target.parquet \
    --join-columns id \
    --output ./reports

Configuration

Create a validation_config.yaml:

source:
  type: snowflake
  connection:
    account: ${SNOWFLAKE_ACCOUNT}
    user: ${SNOWFLAKE_USER}
    password: ${SNOWFLAKE_PASSWORD}
    warehouse: ${SNOWFLAKE_WAREHOUSE}
    database: my_database

target:
  type: bigquery
  connection:
    project: ${GCP_PROJECT}
    credentials_path: ${GOOGLE_APPLICATION_CREDENTIALS}

tables:
  - source_table: customers
    target_table: customers
    join_columns: [customer_id]
    sample_size: 10000  # Optional: limit for large tables

comparison:
  abs_tol: 0.0001
  ignore_case: false
  ignore_spaces: true

output_dir: ./validation_results

Report Output

validation_results/
├── customers_summary.json       # Match statistics
├── customers_report.html        # Visual HTML report
├── customers_column_stats.csv   # Column-level stats
├── customers_rows_only_in_source.csv
├── customers_rows_only_in_target.csv
└── customers_mismatched_rows.csv

Documentation

Full documentation is available at https://yourusername.github.io/pycaroline

API Reference

Core Classes

Class	Description
`DataValidator`	Main orchestrator for validation workflows
`ConfigLoader`	Loads YAML configuration with env var substitution
`DataComparator`	Compares DataFrames using datacompy
`ReportGenerator`	Generates JSON, CSV, and HTML reports
`ConnectorFactory`	Factory for creating database connectors
`FileLoader`	Loads files into polars DataFrames

Functions

Function	Description
`load_file`	Load any supported file format into a DataFrame
`compare_files`	Compare two files and return comparison results
`compare_dataframes`	Compare two DataFrames directly

Exceptions

Exception	Description
`ValidationError`	Validation operation failed
`ConfigurationError`	Invalid configuration
`ConnectionError`	Database connection failed
`QueryError`	Query execution failed
`FileLoadError`	File loading failed

Development

# Clone and install
git clone https://github.com/ryankarlos/pycaroline.git
cd pycaroline
uv sync --all-extras

# Run tests
uv run pytest

# Run with coverage
uv run pytest --cov=pycaroline --cov-report=html

# Serve documentation locally
uv run mkdocs serve

# Lint and format
uv run ruff check .
uv run ruff format .

Contributing

Contributions welcome! Please read CONTRIBUTING.md before submitting PRs.

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3

Jan 1, 2026

0.2

Jan 1, 2026

0.1.0

Dec 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycaroline-0.3.tar.gz (93.4 kB view details)

Uploaded Jan 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycaroline-0.3-py3-none-any.whl (52.9 kB view details)

Uploaded Jan 1, 2026 Python 3

File details

Details for the file pycaroline-0.3.tar.gz.

File metadata

Download URL: pycaroline-0.3.tar.gz
Upload date: Jan 1, 2026
Size: 93.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.3.tar.gz
Algorithm	Hash digest
SHA256	`52432effc722be7c2ed3314f171b0d5114c779d463121ee749251b1dc99966bf`
MD5	`b052e0690de4529918069df9b6d73635`
BLAKE2b-256	`9089b45758d634e6c532786c2107ea9f26ce101384b405d152f34f910b4a82af`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.3.tar.gz:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pycaroline-0.3.tar.gz
- Subject digest: 52432effc722be7c2ed3314f171b0d5114c779d463121ee749251b1dc99966bf
- Sigstore transparency entry: 787189956
- Sigstore integration time: Jan 1, 2026
Source repository:
- Permalink: ryankarlos/pycaroline@1c74f5be82c209eef4b29fa032feeb2aafb79887
- Branch / Tag: refs/tags/v0.3
- Owner: https://github.com/ryankarlos
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@1c74f5be82c209eef4b29fa032feeb2aafb79887
- Trigger Event: push

File details

Details for the file pycaroline-0.3-py3-none-any.whl.

File metadata

Download URL: pycaroline-0.3-py3-none-any.whl
Upload date: Jan 1, 2026
Size: 52.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5969e4b08e61793f1e182b550734e039081377161d45b0a6b43e9334aa8841c0`
MD5	`b04263d6069b54b0c51880fd1f9b5f4f`
BLAKE2b-256	`8956efe17755e21baef1c9df5b53cae0e4b7cf63cd62d93c040b5150aeb06bf5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.3-py3-none-any.whl:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pycaroline-0.3-py3-none-any.whl
- Subject digest: 5969e4b08e61793f1e182b550734e039081377161d45b0a6b43e9334aa8841c0
- Sigstore transparency entry: 787189957
- Sigstore integration time: Jan 1, 2026
Source repository:
- Permalink: ryankarlos/pycaroline@1c74f5be82c209eef4b29fa032feeb2aafb79887
- Branch / Tag: refs/tags/v0.3
- Owner: https://github.com/ryankarlos
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@1c74f5be82c209eef4b29fa032feeb2aafb79887
- Trigger Event: push

pycaroline 0.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

PyCaroline

Why PyCaroline?

Features

Installation

With Database-Specific Dependencies

Quick Start

Python API

Direct DataFrame Comparison

Compare Pandas DataFrames

Compare S3 Files

Compare Data Files

Command Line

Configuration

Report Output

Documentation

API Reference

Core Classes

Functions

Exceptions

Development

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance