Skip to main content

Stream and compare very large CSV files with multiprocessing.

Project description

csv-stream-diff

csv-stream-diff compares very large CSV files with streaming I/O, hashed bucket partitioning, and multiprocessing. It is designed for datasets that are too large to load fully into memory.

https://pypistats.org/packages/csv-stream-diff

Features

  • Compare CSVs by configurable key columns, even when left and right headers differ
  • Stream files in chunks with configurable chunk_size
  • Partition by stable hashed key to keep worker memory bounded
  • Use all CPUs by default, or set a worker count explicitly
  • Write machine-usable output artifacts for left-only, right-only, row-level differences, duplicate keys, and run summary
  • Render a compact terminal summary table at the end of the run instead of dumping raw JSON
  • Support exact random sampling for validation runs with sampling.size > 0
  • Warn on duplicate keys and continue using the first occurrence per key
  • Support per-column numeric tolerances for jitter-prone fields such as FXRate
  • Support Ctrl+C cancellation with pool termination and temp cleanup
  • Include a fixture generator and both pytest and behave tests

Installation

pip install csv-stream-diff

For local development:

poetry install

CLI

csv-stream-diff --config config.yaml

Optional overrides:

csv-stream-diff \
  --config config.yaml \
  --left-file ./left.csv \
  --right-file ./right.csv \
  --chunk-size 100000 \
  --sample-size 100000 \
  --sample-seed 20260321 \
  --workers 8 \
  --output-dir ./output \
  --output-prefix run_

The YAML config is the default source of truth. CLI flags override it for a single run.

Configuration

See config.example.yaml for a full example.

Main sections:

  • files.left, files.right: input CSV paths
  • csv.left, csv.right: dialect and encoding settings
  • keys.left, keys.right: key columns used to match rows
  • compare.left, compare.right: value columns to compare
  • comparison: normalization options
  • comparison.column_tolerances: per-column numeric tolerances keyed by left column name, right column name, or left/right pair
  • sampling: size: 0 means full comparison; any positive value means exact random sample by left-side unique key with a fixed seed
  • performance: chunking, worker count, bucket count, temp directory, progress reporting
  • output: output directory, filename prefix, whether to include serialized full rows once per differing key, and whether to write a text summary

Output Files

The tool writes these artifacts to output.directory:

  • <prefix>only_in_left.csv
  • <prefix>only_in_right.csv
  • <prefix>differences.csv
  • <prefix>duplicate_keys.csv
  • <prefix>summary.json
  • <prefix>summary.txt when output.summary_format is text or both

summary.json includes both raw counts and a formatted different_rows_percentage so you can track improvement run to run.

differences.csv contains one row per differing key with:

  • difference_count
  • differences_text
  • normalized_differences_text when output.include_normalized_values is enabled
  • differences_json

differences_json contains the field-level left/right mismatches for that key. This keeps the diff output far smaller than writing one CSV row per changed field.

When output.include_normalized_values is enabled, each item in differences_json also includes normalized_left_value and normalized_right_value. This is useful for diagnosing cases where raw source values differ but the configured normalization rules should make them compare equal or nearly equal.

Sampling

  • sampling.size: 0 runs the full comparison.
  • sampling.size > 0 selects an exact random sample of left-side unique keys using reservoir sampling.
  • Sampling is reproducible when sampling.seed stays the same.
  • Duplicate keys do not expand the sampling population because only the first occurrence per key is considered.

Value Normalization

By default the comparison is more tolerant of equivalent values that often appear differently in CSV exports:

  • NULL and empty string can be treated as equal
  • 0 and 0.000000000 can be treated as equal for numeric-looking values
  • NULL and 0 can be treated as equal for numeric-looking values

These behaviors are controlled in the comparison section:

  • ignore_case_in_strings or case_insensitive
  • treat_null_as_equal
  • normalize_numeric_values
  • treat_null_as_zero_for_numeric
  • numeric_decimal_places
  • numeric_tolerance
  • column_tolerances
  • normalize_boolean_values

Examples:

  • NULL, "", and " " can be treated as equal
  • "USD" and "usd" can compare equal when ignore_case_in_strings is enabled
  • 14.3553 and 14.355344355 can compare equal with numeric_decimal_places: 4
  • 1.14725 and 1.14724961 can compare equal with numeric_tolerance: 0.0001
  • 175 and 180 can compare equal for FXRate when column_tolerances.FXRate: 5
  • 1 and True can compare equal when normalize_boolean_values is enabled

Duplicate Keys

Duplicate keys do not stop the run. They are written to duplicate_keys.csv, counted in the summary, and the main comparison uses the first occurrence of each key on each side.

Generator

The generator creates two baseline-identical CSVs, applies controlled mutations, writes a matching config, and saves an expected manifest:

python generator/generate_fixtures.py --output-dir ./generated --rows 10000 --seed 42

Generated artifacts:

  • left.csv
  • right.csv
  • config.generated.yaml
  • expected.json

Tests

Run unit tests:

poetry run pytest

Run BDD acceptance tests:

poetry run behave tests/features

Run a package build:

poetry build

PyPI Packaging

Build source and wheel distributions:

poetry build

Upload after verifying artifacts:

poetry publish

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_stream_diff-0.2.8.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csv_stream_diff-0.2.8-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file csv_stream_diff-0.2.8.tar.gz.

File metadata

  • Download URL: csv_stream_diff-0.2.8.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.7 Windows/11

File hashes

Hashes for csv_stream_diff-0.2.8.tar.gz
Algorithm Hash digest
SHA256 2a39dec3173cb1f20cd8a8c0548825433089e99cb9961371f5aae9ccb40a9d07
MD5 d276c1b24deb552838a6482c9685ca99
BLAKE2b-256 8a4f1444982f270c4e0763504696de73bfdd689fa0203274abee853b427ff38a

See more details on using hashes here.

File details

Details for the file csv_stream_diff-0.2.8-py3-none-any.whl.

File metadata

  • Download URL: csv_stream_diff-0.2.8-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.7 Windows/11

File hashes

Hashes for csv_stream_diff-0.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 4e40c293a0b53aa60d71d57921b782df1a7a89b739021dc2fdb24565dbbc7001
MD5 3e3b852b02fac7b85c18b78a54d543b8
BLAKE2b-256 52b9d0cf7356ff367e2c4b86d19a7c0250a7f8e08cdb1107d9ddc65cbd13591a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page