Stream and compare very large CSV files with multiprocessing.
Project description
csv-stream-diff
csv-stream-diff compares very large CSV files with streaming I/O, hashed bucket partitioning, and multiprocessing. It is designed for datasets that are too large to load fully into memory.
https://pypistats.org/packages/csv-stream-diff
Features
- Compare CSVs by configurable key columns, even when left and right headers differ
- Stream files in chunks with configurable
chunk_size - Partition by stable hashed key to keep worker memory bounded
- Use all CPUs by default, or set a worker count explicitly
- Write machine-usable output artifacts for left-only, right-only, row-level differences, duplicate keys, and run summary
- Render a compact terminal summary table at the end of the run instead of dumping raw JSON
- Support exact random sampling for validation runs with
sampling.size > 0 - Warn on duplicate keys and continue using the first occurrence per key
- Support per-column numeric tolerances for jitter-prone fields such as
FXRate - Support
Ctrl+Ccancellation with pool termination and temp cleanup - Include a fixture generator and both
pytestandbehavetests
Installation
pip install csv-stream-diff
For local development:
poetry install
CLI
csv-stream-diff --config config.yaml
Optional overrides:
csv-stream-diff \
--config config.yaml \
--left-file ./left.csv \
--right-file ./right.csv \
--chunk-size 100000 \
--sample-size 100000 \
--sample-seed 20260321 \
--workers 8 \
--output-dir ./output \
--output-prefix run_
The YAML config is the default source of truth. CLI flags override it for a single run.
Configuration
See config.example.yaml for a full example.
Main sections:
files.left,files.right: input CSV pathscsv.left,csv.right: dialect and encoding settingskeys.left,keys.right: key columns used to match rowscompare.left,compare.right: value columns to comparecomparison: normalization optionscomparison.column_tolerances: per-column numeric tolerances keyed by left column name, right column name, orleft/rightpairsampling:size: 0means full comparison; any positive value means exact random sample by left-side unique key with a fixed seedperformance: chunking, worker count, bucket count, temp directory, progress reportingoutput: output directory, filename prefix, whether to include serialized full rows once per differing key, and whether to write a text summary
Output Files
The tool writes these artifacts to output.directory:
<prefix>only_in_left.csv<prefix>only_in_right.csv<prefix>differences.csv<prefix>duplicate_keys.csv<prefix>summary.json<prefix>summary.txtwhenoutput.summary_formatistextorboth
summary.json includes both raw counts and a formatted different_rows_percentage so you can track improvement run to run.
differences.csv contains one row per differing key with:
difference_countdifferences_textnormalized_differences_textwhenoutput.include_normalized_valuesis enableddifferences_json
differences_json contains the field-level left/right mismatches for that key. This keeps the diff output far smaller than writing one CSV row per changed field.
When output.include_normalized_values is enabled, each item in differences_json also includes
normalized_left_value and normalized_right_value. This is useful for diagnosing cases where
raw source values differ but the configured normalization rules should make them compare equal or
nearly equal.
Sampling
sampling.size: 0runs the full comparison.sampling.size > 0selects an exact random sample of left-side unique keys using reservoir sampling.- Sampling is reproducible when
sampling.seedstays the same. - Duplicate keys do not expand the sampling population because only the first occurrence per key is considered.
Value Normalization
By default the comparison is more tolerant of equivalent values that often appear differently in CSV exports:
NULLand empty string can be treated as equal0and0.000000000can be treated as equal for numeric-looking valuesNULLand0can be treated as equal for numeric-looking values
These behaviors are controlled in the comparison section:
treat_null_as_equalnormalize_numeric_valuestreat_null_as_zero_for_numericnumeric_decimal_placesnumeric_tolerancecolumn_tolerancesnormalize_boolean_values
Examples:
NULL,"", and" "can be treated as equal14.3553and14.355344355can compare equal withnumeric_decimal_places: 41.14725and1.14724961can compare equal withnumeric_tolerance: 0.0001175and180can compare equal forFXRatewhencolumn_tolerances.FXRate: 51andTruecan compare equal whennormalize_boolean_valuesis enabled
Duplicate Keys
Duplicate keys do not stop the run. They are written to duplicate_keys.csv, counted in the summary, and the main comparison uses the first occurrence of each key on each side.
Generator
The generator creates two baseline-identical CSVs, applies controlled mutations, writes a matching config, and saves an expected manifest:
python generator/generate_fixtures.py --output-dir ./generated --rows 10000 --seed 42
Generated artifacts:
left.csvright.csvconfig.generated.yamlexpected.json
Tests
Run unit tests:
poetry run pytest
Run BDD acceptance tests:
poetry run behave tests/features
Run a package build:
poetry build
PyPI Packaging
Build source and wheel distributions:
poetry build
Upload after verifying artifacts:
poetry publish
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file csv_stream_diff-0.2.7.tar.gz.
File metadata
- Download URL: csv_stream_diff-0.2.7.tar.gz
- Upload date:
- Size: 29.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.5 CPython/3.12.7 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e8947fcc1cd3803b283d1e2373547eba3995a2183b849a88dea582f8bf94412
|
|
| MD5 |
aa9ff4b7e338800d8b82d10fc82d9194
|
|
| BLAKE2b-256 |
3a9ea6344719c8456db037080db4ff252f25293e94777c2e579425d04d7c28c5
|
File details
Details for the file csv_stream_diff-0.2.7-py3-none-any.whl.
File metadata
- Download URL: csv_stream_diff-0.2.7-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.5 CPython/3.12.7 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3d6b3967ca13dc9db6f0ec6cb49d7bfa876f34e5e2fa94739996fd031619f60
|
|
| MD5 |
4eb205c751e8630001d333c918d4f635
|
|
| BLAKE2b-256 |
a010a87b7c1d61faec789ac4a195b1710257f2729c8c47594b459c345e80d374
|