Skip to main content

Fast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports—built on Apache Arrow.

Project description

tablediff-arrow

Fast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports—built on Apache Arrow.

CI Python 3.10+ License: MIT

Features

  • Fast: Built on Apache Arrow for high-performance data processing
  • Multiple Formats: Support for Parquet, CSV, and Arrow IPC files
  • S3 Support: Read files directly from S3 (optional)
  • Keyed Comparisons: Compare tables using one or more key columns
  • Numeric Tolerances: Configure absolute and relative tolerances for numeric columns
  • Rich Reports: Generate HTML and CSV reports with detailed differences
  • Python 3.10+: Modern Python with type hints and clean APIs
  • Well Tested: Comprehensive test suite with high coverage

Installation

pip install tablediff-arrow

For S3 support:

pip install tablediff-arrow[s3]

For development:

pip install -e ".[dev]"

Quick Start

Command Line Interface

Compare two Parquet files using id as the key column:

tablediff left.parquet right.parquet -k id

Compare with numeric tolerance:

tablediff left.csv right.csv -k id -t amount:0.01

Generate an HTML report:

tablediff left.parquet right.parquet -k id -o report.html

Compare S3 files:

tablediff s3://bucket/left.parquet s3://bucket/right.parquet -k id --s3

Python API

from tablediff_arrow import TableDiff

# Create a differ with key columns and tolerances
differ = TableDiff(
    key_columns=['id'],
    tolerance={'amount': 0.01},  # Absolute tolerance
    relative_tolerance={'price': 0.001}  # Relative tolerance (0.1%)
)

# Compare files
result = differ.compare_files('left.parquet', 'right.parquet')

# Print summary
print(result.summary())

# Check if there are differences
if result.has_differences:
    print(f"Found {result.changed_rows} changed rows")
    print(f"Found {result.left_only_rows} rows only in left")
    print(f"Found {result.right_only_rows} rows only in right")

# Generate reports
from tablediff_arrow.reports import generate_html_report, generate_csv_report

generate_html_report(result, 'report.html')
generate_csv_report(result, 'output_dir/', prefix='diff')

Usage Examples

Multiple Key Columns

Compare tables using composite keys:

tablediff left.parquet right.parquet -k year -k month -k product
differ = TableDiff(key_columns=['year', 'month', 'product'])
result = differ.compare_files('left.parquet', 'right.parquet')

Numeric Tolerances

Use absolute tolerance for monetary values:

tablediff left.csv right.csv -k id -t amount:0.01 -t balance:0.001

Use relative tolerance for percentages:

tablediff left.csv right.csv -k id -r rate:0.001 -r score:0.01
differ = TableDiff(
    key_columns=['id'],
    tolerance={'amount': 0.01, 'balance': 0.001},
    relative_tolerance={'rate': 0.001, 'score': 0.01}
)

Working with PyArrow Tables

import pyarrow as pa
from tablediff_arrow import TableDiff

# Create tables directly
left = pa.table({'id': [1, 2, 3], 'value': [10, 20, 30]})
right = pa.table({'id': [1, 2, 3], 'value': [10, 21, 30]})

# Compare
differ = TableDiff(key_columns=['id'])
result = differ.compare_tables(left, right)

print(result.summary())

S3 Files

import s3fs
from tablediff_arrow import TableDiff

# Create S3 filesystem
fs = s3fs.S3FileSystem()

# Compare S3 files
differ = TableDiff(key_columns=['id'])
result = differ.compare_files(
    's3://my-bucket/left.parquet',
    's3://my-bucket/right.parquet',
    filesystem=fs
)

CLI Options

Usage: tablediff [OPTIONS] LEFT RIGHT

  Compare two tables and generate diff reports.

Arguments:
  LEFT   Path to the left/source table file (local or s3://)
  RIGHT  Path to the right/target table file (local or s3://)

Options:
  -k, --key TEXT              Key column(s) for comparison (required, can be
                              specified multiple times)
  -t, --tolerance TEXT        Absolute tolerance for numeric columns
                              (format: column:value)
  -r, --relative-tolerance    Relative tolerance for numeric columns
                              (format: column:value)
  --left-format [parquet|csv|arrow]
                              Format of the left file
  --right-format [parquet|csv|arrow]
                              Format of the right file
  -o, --output TEXT           Output file path for HTML report
  --csv-output PATH           Output directory for CSV reports
  --s3                        Enable S3 filesystem support
  --help                      Show this message and exit.

Output Reports

HTML Report

The HTML report provides an interactive view of differences:

  • Summary statistics (matched, changed, added, removed rows)
  • Color-coded differences table
  • Separate sections for left-only and right-only rows
  • Change counts per column

CSV Reports

CSV output generates multiple files:

  • {prefix}_summary.csv: Summary statistics
  • {prefix}_changes.csv: Detailed changes with old and new values
  • {prefix}_left_only.csv: Rows only in the left table
  • {prefix}_right_only.csv: Rows only in the right table

Development

Setup

# Clone the repository
git clone https://github.com/psmman/tablediff-arrow.git
cd tablediff-arrow

# Install with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=tablediff_arrow --cov-report=html

# Run specific test file
pytest tests/test_compare.py

Code Quality

# Format code
black src tests

# Lint
ruff check src tests

# Type check
mypy src

Pre-commit Hooks

The project uses pre-commit hooks to ensure code quality:

  • trailing-whitespace: Remove trailing whitespace
  • end-of-file-fixer: Ensure files end with a newline
  • check-yaml/json/toml: Validate config files
  • black: Format Python code
  • ruff: Lint Python code
  • mypy: Type checking

Requirements

  • Python 3.10 or higher
  • pyarrow >= 14.0.0
  • pandas >= 2.0.0
  • click >= 8.0.0
  • jinja2 >= 3.0.0
  • s3fs >= 2023.0.0 (optional, for S3 support)

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tablediff_arrow-0.1.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tablediff_arrow-0.1.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file tablediff_arrow-0.1.0.tar.gz.

File metadata

  • Download URL: tablediff_arrow-0.1.0.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for tablediff_arrow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c23fb28970c27f095d8193710e0825d69ad5ad1120ab5c3189a3d51ed95d82c1
MD5 a19ce70975204c8bd34239f2300de4b9
BLAKE2b-256 e827c5c40e7cf36b3893155b40e6fd3274944d40a8dfc0696cac74df54599919

See more details on using hashes here.

File details

Details for the file tablediff_arrow-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tablediff_arrow-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f3abb723c8e8058d8288c43048c86f14b2b8b289631e4f12d879096c53b9bf61
MD5 153f1725fe26cef3479f2c8de1d4e0cc
BLAKE2b-256 495b7ac64916f844840ef32f42e0905f5ba600f74b2e0b759dd16252800bede3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page