Skip to main content

Data validation library for comparing tables across cloud data warehouses, cloud storage, and databases

Project description

PyCaroline

PyPI version Python 3.12+ License: MIT Tests Codecov Documentation

A Python library for validating data migrations between cloud data warehouses, cloud storage (S3, GCS), and databases. Built on datacompy, PyCaroline provides a unified interface for connecting to various data sources and comparing datasets with detailed reporting.

Why PyCaroline?

Data migrations are risky. Whether you're moving from Snowflake to BigQuery, consolidating data warehouses, comparing S3 files with database tables, or validating ETL pipelines, you need confidence that your data arrived intact. PyCaroline makes this easy.

Features

  • 🔌 Multi-database support - Snowflake, BigQuery, Redshift, MySQL, PostgreSQL with unified API
  • ☁️ Cloud storage support - Read and compare data from AWS S3 and Google Cloud Storage
  • 📁 File format support - Load and compare Parquet, CSV, Excel, JSON, NDJSON, Avro, IPC/Feather files
  • 📊 Direct DataFrame input - Compare polars, pandas, or snowpark DataFrames directly
  • 🔍 Flexible comparison - Row-level and column-level with configurable tolerances
  • 📈 Rich reports - JSON summaries, CSV details, and beautiful HTML reports
  • 🖥️ CLI & Python API - Use from command line or integrate into your code
  • ⚙️ Configuration-driven - YAML config with environment variable substitution
  • 🧪 Well-tested - 90%+ test coverage with property-based tests
  • 🐍 Modern Python - Supports Python 3.12 and 3.13

Installation

# Using uv (recommended)
uv add pycaroline

# Using pip
pip install pycaroline

With Database-Specific Dependencies

# Snowflake
uv add "pycaroline[snowflake]"

# BigQuery
uv add "pycaroline[bigquery]"

# Redshift
uv add "pycaroline[redshift]"

# MySQL
uv add "pycaroline[mysql]"

# PostgreSQL
uv add "pycaroline[postgresql]"

# Cloud Storage (S3)
uv add "pycaroline[s3]"

# Cloud Storage (GCS)
uv add "pycaroline[gcs]"

# Pandas DataFrame support
uv add "pycaroline[pandas]"

# All connectors
uv add "pycaroline[all]"

Quick Start

Python API

from pycaroline import DataValidator, ConfigLoader, DataComparator, ComparisonConfig
from pathlib import Path

# Using configuration file
config = ConfigLoader.load(Path("validation_config.yaml"))
validator = DataValidator(config)
results = validator.validate()

for table, result in results.items():
    print(f"{table}: {result.matching_rows}/{result.source_row_count} rows match")

Direct DataFrame Comparison

import polars as pl
from pycaroline import DataComparator, ComparisonConfig

source_df = pl.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pl.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})

comparator = DataComparator(ComparisonConfig(
    join_columns=["id"],
    ignore_case=True,
    ignore_spaces=True,
))
result = comparator.compare(source_df, target_df)

print(f"Matching rows: {result.matching_rows}")
print(f"Rows only in source: {len(result.rows_only_in_source)}")
print(f"Rows only in target: {len(result.rows_only_in_target)}")

Compare Pandas DataFrames

import pandas as pd
from pycaroline import compare_dataframes

source_df = pd.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pd.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})

result = compare_dataframes(source_df, target_df, join_columns=["id"])
print(f"Matching rows: {result.matching_rows}")

Compare S3 Files

from pycaroline.connectors import S3Connector

with S3Connector(bucket="my-bucket") as conn:
    source_df = conn.query("data/source.parquet")
    target_df = conn.query("data/target.parquet")

result = compare_dataframes(source_df, target_df, join_columns=["id"])

Compare Data Files

from pycaroline import load_file, compare_files

# Load any supported file format
df = load_file("data.parquet")
df = load_file("data.csv", delimiter=";")
df = load_file("data.xlsx", sheet_name="Sheet2")

# Compare two files directly
result = compare_files(
    "source.csv",
    "target.parquet",
    join_columns=["id"],
    ignore_case=True
)
print(f"Matching rows: {result.matching_rows}")

Command Line

# Validate using config file
pycaroline validate --config validation_config.yaml --output ./reports

# Quick comparison
pycaroline compare \
    --source-type snowflake \
    --target-type bigquery \
    --source-table my_schema.customers \
    --target-table my_dataset.customers \
    --join-columns customer_id

# Compare data files
pycaroline compare-files \
    --source source.csv \
    --target target.parquet \
    --join-columns id \
    --output ./reports

Configuration

Create a validation_config.yaml:

source:
  type: snowflake
  connection:
    account: ${SNOWFLAKE_ACCOUNT}
    user: ${SNOWFLAKE_USER}
    password: ${SNOWFLAKE_PASSWORD}
    warehouse: ${SNOWFLAKE_WAREHOUSE}
    database: my_database

target:
  type: bigquery
  connection:
    project: ${GCP_PROJECT}
    credentials_path: ${GOOGLE_APPLICATION_CREDENTIALS}

tables:
  - source_table: customers
    target_table: customers
    join_columns: [customer_id]
    sample_size: 10000  # Optional: limit for large tables

comparison:
  abs_tol: 0.0001
  ignore_case: false
  ignore_spaces: true

output_dir: ./validation_results

Report Output

validation_results/
├── customers_summary.json       # Match statistics
├── customers_report.html        # Visual HTML report
├── customers_column_stats.csv   # Column-level stats
├── customers_rows_only_in_source.csv
├── customers_rows_only_in_target.csv
└── customers_mismatched_rows.csv

Documentation

Full documentation is available at https://yourusername.github.io/pycaroline

API Reference

Core Classes

Class Description
DataValidator Main orchestrator for validation workflows
ConfigLoader Loads YAML configuration with env var substitution
DataComparator Compares DataFrames using datacompy
ReportGenerator Generates JSON, CSV, and HTML reports
ConnectorFactory Factory for creating database connectors
FileLoader Loads files into polars DataFrames

Functions

Function Description
load_file Load any supported file format into a DataFrame
compare_files Compare two files and return comparison results
compare_dataframes Compare two DataFrames directly

Exceptions

Exception Description
ValidationError Validation operation failed
ConfigurationError Invalid configuration
ConnectionError Database connection failed
QueryError Query execution failed
FileLoadError File loading failed

Development

# Clone and install
git clone https://github.com/ryankarlos/pycaroline.git
cd pycaroline
uv sync --all-extras

# Run tests
uv run pytest

# Run with coverage
uv run pytest --cov=pycaroline --cov-report=html

# Serve documentation locally
uv run mkdocs serve

# Lint and format
uv run ruff check .
uv run ruff format .

Contributing

Contributions welcome! Please read CONTRIBUTING.md before submitting PRs.

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycaroline-0.3.tar.gz (93.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycaroline-0.3-py3-none-any.whl (52.9 kB view details)

Uploaded Python 3

File details

Details for the file pycaroline-0.3.tar.gz.

File metadata

  • Download URL: pycaroline-0.3.tar.gz
  • Upload date:
  • Size: 93.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.3.tar.gz
Algorithm Hash digest
SHA256 52432effc722be7c2ed3314f171b0d5114c779d463121ee749251b1dc99966bf
MD5 b052e0690de4529918069df9b6d73635
BLAKE2b-256 9089b45758d634e6c532786c2107ea9f26ce101384b405d152f34f910b4a82af

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.3.tar.gz:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycaroline-0.3-py3-none-any.whl.

File metadata

  • Download URL: pycaroline-0.3-py3-none-any.whl
  • Upload date:
  • Size: 52.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycaroline-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5969e4b08e61793f1e182b550734e039081377161d45b0a6b43e9334aa8841c0
MD5 b04263d6069b54b0c51880fd1f9b5f4f
BLAKE2b-256 8956efe17755e21baef1c9df5b53cae0e4b7cf63cd62d93c040b5150aeb06bf5

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycaroline-0.3-py3-none-any.whl:

Publisher: publish.yml on ryankarlos/pycaroline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page