Skip to main content

Stable exact comparison engine foundation for Python and the UniqTools ecosystem.

Project description

uniqdiff

uniqdiff is a stable comparison engine foundation for Python projects and the UniqTools ecosystem. It compares datasets, files, streams, and connector-backed sources, then returns exact unique differences, intersections, duplicates, result metadata, and comparison statistics.

Its purpose is to provide stable exact comparison semantics, token extraction, backends, result objects, lazy result readers, connectors, and a direct CLI. Product-layer features such as reports, schema validation, data quality rules, dashboards, and workflow orchestration belong in higher-level UniqTools packages.

Installation

pip install uniqdiff

Install optional Parquet support:

pip install "uniqdiff[parquet]"

For local development:

pip install -e ".[dev]"

Documentation

Key guides:

Project files:

Quick Start

from uniqdiff import compare

result = compare([1, 2, 3], [3, 4, 5], include_common=True)

print(result.only_in_first)   # [1, 2]
print(result.only_in_second)  # [4, 5]
print(result.common)          # [3]
print(result.unique)          # [1, 2, 4, 5]

Compare Dictionaries By Key

from uniqdiff import compare_by_key

old = [{"id": 1, "name": "Ann"}, {"id": 2, "name": "Bob"}]
new = [{"id": 2, "name": "Bob"}, {"id": 3, "name": "Cara"}]

result = compare_by_key(old, new, key="id")

assert result.only_in_first == [{"id": 1, "name": "Ann"}]
assert result.only_in_second == [{"id": 3, "name": "Cara"}]

Normalization

from uniqdiff import compare, string_normalizer

normalizer = string_normalizer(lower=True, strip=True, remove_spaces=True)
result = compare([" Alice ", "Bob"], ["alice", "Cara"], normalizer=normalizer)

File Comparison

from uniqdiff import compare_files

result = compare_files("old.csv", "new.csv", key="id", format="csv")
result = compare_files("old.csv", "new.csv", key="id", format="csv", delimiter=";")
result = compare_files("old.parquet", "new.parquet", key="id", columns=("id", "name"))

Supported formats:

  • csv
  • tsv
  • jsonl
  • parquet with uniqdiff[parquet]
  • txt
  • gzip-compressed variants such as .csv.gz, .tsv.gz, .jsonl.gz, and .txt.gz

Connectors

Connectors provide a small extension layer for reading data sources. Built-ins cover iterables and local files:

from uniqdiff import compare_sources

result = compare_sources(
    "old.csv",
    "new.csv",
    first_kind="csv",
    second_kind="csv",
    key="id",
)

Registered connector names:

  • iterable
  • file
  • csv
  • tsv
  • tab
  • jsonl
  • parquet
  • pq
  • txt
  • text

Custom connectors implement open() and describe() and can be registered with register_connector.

CSV and TSV connectors support dialect options such as delimiter, quotechar, has_header, and fieldnames.

Parquet support is optional and uses pyarrow. Install it with uniqdiff[parquet]. The Parquet connector supports columns and batch_size.

Disk Mode

mode="disk" uses a temporary SQLite database from the Python standard library. Input iterables are consumed incrementally and indexed on disk, which is useful when the input data should not be fully materialized as Python sets.

from uniqdiff import compare

result = compare(
    stream_a,
    stream_b,
    key="id",
    mode="disk",
    disk_strategy="sqlite",
    chunk_size=100_000,
    temp_dir="./tmp",
    disk_limit="10GB",
)

mode="auto" uses a small, predictable heuristic:

  • result_mode="file" chooses disk mode;
  • temp_dir chooses disk mode;
  • memory_limit is compared with an estimated input size;
  • unsized iterables/generators choose disk mode when memory_limit is set;
  • otherwise auto keeps the memory backend.

The current size estimate for sized inputs is intentionally conservative and simple: len(first) + len(second) multiplied by an internal per-item estimate. The decision is stored in result.metadata["auto_decision"].

result = compare(
    rows_a,
    rows_b,
    mode="auto",
    memory_limit="512MB",
)

print(result.metadata["backend"])
print(result.metadata["auto_decision"])

For very large inputs, hash partitioning can reduce peak memory by comparing one partition at a time. It is a stable 1.0 backend documented as an advanced mode because partition count, key skew, and temporary disk usage matter:

result = compare(
    stream_a,
    stream_b,
    key="id",
    mode="disk",
    disk_strategy="hash_partition",
    partition_count=128,
    temp_dir="./tmp",
)

Hash partitioning writes temporary partition files and guarantees that equal comparison tokens are processed in the same partition.

External sort is also available when sorted chunk files are a better fit than partition files. It is also a stable 1.0 backend documented as an advanced mode:

result = compare(
    stream_a,
    stream_b,
    key="id",
    mode="disk",
    disk_strategy="external_sort",
    chunk_size=250_000,
    temp_dir="./tmp",
)

This backend sorts each chunk on disk, performs a merge pass over both sorted token streams, and emits each result section in original input order for that side. Ordering is still not part of the cross-backend semantic contract.

File Result Mode

For large outputs, use result_mode="file" with .jsonl or .csv output. In this mode, result rows are written to disk and are not materialized in CompareResult. Statistics and metadata are still returned.

result = compare(
    stream_a,
    stream_b,
    key="id",
    mode="disk",
    disk_strategy="sqlite",
    result_mode="file",
    output="diff.jsonl",
)

print(result.stats.only_in_first_count)
print(result.metadata["output"])

Read large result files lazily:

from uniqdiff import iter_result_values

for value in iter_result_values("diff.jsonl", sections=("only_in_first",)):
    print(value)

File-backed CompareResult objects can also stream values:

for value in result.iter_unique():
    print(value)

Each JSONL/CSV row contains:

  • section: only_in_first, only_in_second, common, duplicates_first, or duplicates_second;
  • value: the original item.

CLI

After installation, uniqdiff can compare files from the command line:

uniqdiff compare old.csv new.csv --format csv --key id
uniqdiff compare old.csv new.csv --format csv --key id --summary
uniqdiff diff old.csv new.csv --format csv --key id --summary --fail-on-diff
uniqdiff compare old.csv new.csv --format csv --key id --mode disk --disk-strategy hash-partition
uniqdiff compare old.csv new.csv --format csv --key id --mode disk --disk-strategy external-sort
uniqdiff compare old.csv new.csv --format csv --key id --mode disk --result-mode file --output diff.jsonl
uniqdiff diff old.txt new.txt --format txt --output result.json
uniqdiff intersection old.jsonl new.jsonl --format jsonl --key id
uniqdiff duplicates users.csv --format csv --key email

Useful CI flags:

  • --summary: print compact counters instead of full result rows;
  • --fail-on-diff: return exit code 1 when compare/diff find differences or duplicates finds duplicates.

Common options:

  • --mode memory|disk|auto
  • --chunk-size 100000
  • --memory-limit 512MB
  • --temp-dir ./tmp
  • --disk-limit 10GB
  • --disk-strategy sqlite|hash-partition|external-sort
  • --partition-count 128
  • --result-mode memory|file
  • --lower
  • --remove-spaces
  • --remove-special

Benchmarks

Run local benchmark scenarios with:

python benchmarks/run.py --size 100000

For a quick smoke run:

python benchmarks/run.py --size 1000 --scenario memory --scenario sqlite

The benchmark runner reports elapsed time, peak memory, result counts, and output file size for file-result scenarios.

Commercial Support

uniqdiff Core is free and open-source under the Apache License 2.0. Basic comparison, local file support, CLI usage, exact backends, file result mode, lazy readers, and the public engine API are not paid features.

Commercial support is available for teams that need production integration, performance audits, CI/CD workflows, custom connectors, row-level diff, or reporting through the broader UniqTools ecosystem.

See COMMERCIAL.md, SUPPORT.md, and SERVICES.md.

Contact: dredpirite@gmail.com

Fuzzy Comparison

Approximate string comparison is available through a separate API so exact comparison semantics stay unchanged:

from uniqdiff import compare_fuzzy_strings, string_normalizer

result = compare_fuzzy_strings(
    ["Alice Smith"],
    ["alice smyth"],
    threshold=75,
    normalizer=string_normalizer(lower=True),
)

Install uniqdiff[fuzzy] to use rapidfuzz; otherwise the stdlib difflib fallback is used. Fuzzy comparison is approximate, greedy, and O(n*m). It is a helper API, not part of the exact comparison engine.

Bloom Filter Candidates

Bloom filter helpers are available for approximate candidate filtering:

from uniqdiff import probabilistic_diff_candidates

result = probabilistic_diff_candidates(
    old_ids,
    new_ids,
    expected_first=1_000_000,
    expected_second=1_000_000,
)

Bloom filters can produce false positives. In candidate-diff workflows, a false positive can hide a true difference, so this helper is not a replacement for exact comparison when every difference must be returned.

Stable 1.0 Scope

The stable 1.0 engine provides:

  • list/tuple/set/iterable comparison;
  • dictionary/object/dataclass comparison by key;
  • recursive canonicalization for non-hashable structures;
  • optional normalization;
  • duplicate detection;
  • file readers for CSV, JSONL, and text;
  • connector API for iterable and file-backed sources;
  • CLI commands for compare, diff, intersection, and duplicates;
  • SQLite-backed disk mode for exact comparison without optional dependencies;
  • hash partitioning disk strategy for partition-by-partition comparison;
  • external sort disk strategy for sorted chunk and merge comparison;
  • file result mode for streaming large results to JSONL/CSV output;
  • approximate fuzzy string comparison as a helper outside exact semantics;
  • Bloom filter helpers for probabilistic candidate filtering outside exact semantics;
  • property-based tests that compare backend semantics;
  • benchmark runner for memory and disk strategies;
  • result serialization to dict, JSON, JSONL, and CSV;
  • stable API parameters for memory, disk, and auto modes.

Lazy result readers are already available for JSONL/CSV file-result outputs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniqdiff-1.0.0.tar.gz (46.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uniqdiff-1.0.0-py3-none-any.whl (43.5 kB view details)

Uploaded Python 3

File details

Details for the file uniqdiff-1.0.0.tar.gz.

File metadata

  • Download URL: uniqdiff-1.0.0.tar.gz
  • Upload date:
  • Size: 46.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for uniqdiff-1.0.0.tar.gz
Algorithm Hash digest
SHA256 da995eccefb59e89501372344559c723960952857fe5763fe7aa4beab14cd387
MD5 418943e76ba1384252aae0ca9de92c42
BLAKE2b-256 cd8b976db029702284d5a9d040349682da7cc1c817f5a7aed91cde92706e2a8e

See more details on using hashes here.

File details

Details for the file uniqdiff-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: uniqdiff-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 43.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for uniqdiff-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de05e632acb54d3fae4e5222a97008b57b9a97a607da4c39725e764bc42cfc7b
MD5 5e04c7b4e1dd07af36c38f73c14fe9cc
BLAKE2b-256 4c904ed0941aa87fcf3ae3184c8634f70fbfaa1fb1cf4a179b60f58d10755c7b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page