Skip to main content

Stream and compare very large CSV files with multiprocessing.

Project description

csv-stream-diff

csv-stream-diff compares very large CSV files with streaming I/O, hashed bucket partitioning, and multiprocessing. It is designed for datasets that are too large to load fully into memory.

Features

  • Compare CSVs by configurable key columns, even when left and right headers differ
  • Stream files in chunks with configurable chunk_size
  • Partition by stable hashed key to keep worker memory bounded
  • Use all CPUs by default, or set a worker count explicitly
  • Write machine-usable output artifacts for left-only, right-only, cell differences, duplicate keys, and run summary
  • Support exact random sampling for validation runs with sampling.size > 0
  • Warn on duplicate keys and continue using the first occurrence per key
  • Include a fixture generator and both pytest and behave tests

Installation

pip install csv-stream-diff

For local development:

poetry install

CLI

csv-stream-diff --config config.yaml

Optional overrides:

csv-stream-diff \
  --config config.yaml \
  --left-file ./left.csv \
  --right-file ./right.csv \
  --chunk-size 100000 \
  --sample-size 100000 \
  --sample-seed 20260321 \
  --workers 8 \
  --output-dir ./output \
  --output-prefix run_

The YAML config is the default source of truth. CLI flags override it for a single run.

Configuration

See config.example.yaml for a full example.

Main sections:

  • files.left, files.right: input CSV paths
  • csv.left, csv.right: dialect and encoding settings
  • keys.left, keys.right: key columns used to match rows
  • compare.left, compare.right: value columns to compare
  • comparison: normalization options
  • sampling: size: 0 means full comparison; any positive value means exact random sample by left-side unique key with a fixed seed
  • performance: chunking, worker count, bucket count, temp directory, progress reporting
  • output: output directory, filename prefix, whether to include serialized full rows, and whether to write a text summary

Output Files

The tool writes these artifacts to output.directory:

  • <prefix>only_in_left.csv
  • <prefix>only_in_right.csv
  • <prefix>differences.csv
  • <prefix>duplicate_keys.csv
  • <prefix>summary.json
  • <prefix>summary.txt when output.summary_format is text or both

differences.csv contains one row per differing cell with both the left and right column names and values.

Sampling

  • sampling.size: 0 runs the full comparison.
  • sampling.size > 0 selects an exact random sample of left-side unique keys using reservoir sampling.
  • Sampling is reproducible when sampling.seed stays the same.
  • Duplicate keys do not expand the sampling population because only the first occurrence per key is considered.

Duplicate Keys

Duplicate keys do not stop the run. They are written to duplicate_keys.csv, counted in the summary, and the main comparison uses the first occurrence of each key on each side.

Generator

The generator creates two baseline-identical CSVs, applies controlled mutations, writes a matching config, and saves an expected manifest:

python generator/generate_fixtures.py --output-dir ./generated --rows 10000 --seed 42

Generated artifacts:

  • left.csv
  • right.csv
  • config.generated.yaml
  • expected.json

Tests

Run unit tests:

poetry run pytest

Run BDD acceptance tests:

poetry run behave tests/features

Run a package build:

poetry build

PyPI Packaging

Build source and wheel distributions:

poetry build

Upload after verifying artifacts:

poetry publish

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_stream_diff-0.1.0.tar.gz (23.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csv_stream_diff-0.1.0-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file csv_stream_diff-0.1.0.tar.gz.

File metadata

  • Download URL: csv_stream_diff-0.1.0.tar.gz
  • Upload date:
  • Size: 23.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.7 Windows/11

File hashes

Hashes for csv_stream_diff-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8955d32c42a8498c7755d190802ed82cda0f0294aa4b56f1ad2f08e8e700f68c
MD5 5fdfcdbdbcf14df3af34353ac851709b
BLAKE2b-256 d7d1d7ef03846e9232930008a6e8238835be56475a7427bd41bcc5fff09c4c82

See more details on using hashes here.

File details

Details for the file csv_stream_diff-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: csv_stream_diff-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.7 Windows/11

File hashes

Hashes for csv_stream_diff-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a62a58b9d5f51c66f59cbf7f74ae72e2543d62b6066db15b17aeea702afcba92
MD5 f35bc9acb24f113f1fe323ba5ef941b7
BLAKE2b-256 3270692e7b6cd4e3ff4dcea9c1c8cc4d922c18991cadbfe0c6387cefefc58019

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page