Skip to main content

Minimal tool to compare CSV files and generate diff reports

Project description

CI

📊 minimal-csv-diff

A high-performance tool to compare CSV files and generate diff reports for data validation. Built with Polars for 10-100x faster comparisons on large datasets.

✨ Features

  • 🔍 Compare two CSV files with composite key matching
  • Blazing fast — handles 800k+ rows in seconds (Polars-powered)
  • 🎯 Interactive mode or CLI with explicit keys
  • 📋 Detailed diff reports showing unique rows and column-level changes
  • 🤖 LLM-agent friendly API for programmatic access
  • 📁 Exports results to CSV format for further analysis

🚀 Quick Start

Option 1: Run Instantly (No Installation) ⭐

uvx minimal-csv-diff

Option 2: Install & Run

pip install minimal-csv-diff
minimal-csv-diff

Option 3: CLI with Explicit Keys

minimal-csv-diff file1.csv file2.csv --key="id,date,name" --output=diff.csv

🎮 Try the Demo

Want to see it in action? Check out the demo directory:

cd demo/
minimal-csv-diff
# Follow prompts: select files 0,1 and choose a key column
# See the magic happen! ✨

The demo includes sample CSV files and shows how the tool identifies:

  • 🔴 Unique rows (exist in only one file)
  • 🟡 Column differences (same record, different values)
  • Matching records (excluded from output)

🧠 How It Works

The diff engine uses a three-phase comparison strategy:

flowchart TD
    subgraph "Phase 1: Load & Normalize"
        A["CSV File 1"] --> B["Polars DataFrame"]
        C["CSV File 2"] --> D["Polars DataFrame"]
        B --> E["Normalize Key Columns<br/>(strip whitespace, fill nulls)"]
        D --> F["Normalize Key Columns<br/>(strip whitespace, fill nulls)"]
    end
    
    subgraph "Phase 2: Identify Differences"
        E --> G["Anti-Join: df1 ← df2<br/>(rows only in file1)"]
        F --> H["Anti-Join: df2 ← df1<br/>(rows only in file2)"]
        E --> I["Inner Join on Keys<br/>(rows in both files)"]
        F --> I
        I --> J["Compare Non-Key Columns<br/>(find modifications)"]
    end
    
    subgraph "Phase 3: Output"
        G --> K["UNIQUE ROW<br/>from file1"]
        H --> L["UNIQUE ROW<br/>from file2"]
        J --> M["MODIFIED ROW<br/>with failed_columns"]
        K --> N["diff.csv"]
        L --> N
        M --> N
    end
    
    style E fill:#51cf66
    style F fill:#51cf66
    style N fill:#339af0

Key Matching Logic

Scenario Result
Key exists in file1 only UNIQUE ROW (source: file1)
Key exists in file2 only UNIQUE ROW (source: file2)
Key exists in both, values identical No output (match)
Key exists in both, values differ Two rows showing old → new values

📤 Output Format

When differences are found, generates a diff.csv with:

Column Description
surrogate_key Concatenated key fields (e.g., acme|sales|orders)
source Which file the row comes from
failed_columns UNIQUE ROW or list of changed columns
...all columns Complete row data for comparison

Example output:

"source","failed_columns","surrogate_key","id","name","value"
"old.csv","value","1|Alice","1","Alice","100"
"new.csv","value","1|Alice","1","Alice","150"
"old.csv","UNIQUE ROW","2|Bob","2","Bob","200"

🤖 Programmatic API (LLM-Agent Friendly)

from minimal_csv_diff import compare_csv_files, quick_csv_diff

# Option 1: Explicit keys
result = compare_csv_files(
    'old.csv', 'new.csv',
    key_columns=['id', 'date'],
    output_file='diff.csv'
)

# Option 2: Auto-detect keys
result = quick_csv_diff('old.csv', 'new.csv')

# Check results
if result['differences_found']:
    print(f"Found {result['summary']['total_differences']} differences")
    print(f"Output: {result['output_file']}")

Return structure:

{
    'status': 'success' | 'no_differences' | 'error',
    'differences_found': bool,
    'output_file': str | None,
    'summary': {
        'total_differences': int,
        'unique_rows': int,
        'modified_rows': int,
        'common_columns': int,
        'key_columns_used': list
    },
    'error_message': str | None
}

💡 Use Cases

  • 🔄 Data validation between different data sources
  • 🔧 ETL pipeline testing — compare before/after transformations
  • 🗄️ Database migration verification — ensure data integrity
  • 📊 Looker/BI validation — compare query results across environments
  • 🧪 A/B testing — identify differences in experimental datasets
  • 🤖 LLM workflows — automated data quality checks

⚡ Performance

Built on Polars for maximum performance:

Dataset Size Time
10k rows < 1 second
100k rows ~2 seconds
800k rows ~20 seconds

All string normalization runs as native Rust SIMD operations — no Python UDF overhead.

🛠️ Development

This project uses uv for dependency management.

git clone https://github.com/joon-solutions/minimal-csv-diff
cd minimal-csv-diff
uv sync
uv run pytest tests/ -v
uv run minimal-csv-diff

📋 Requirements

  • Python >= 3.10
  • polars >= 0.20.0

📦 Publishing a New Release

This project uses a tag-triggered CI/CD workflow for publishing to PyPI. This approach is simple, explicit, and foolproof.

How It Works

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant Actions as GitHub Actions
    participant PyPI as PyPI Registry
    
    Dev->>Dev: 1. Update version in pyproject.toml
    Dev->>Dev: 2. Commit changes
    Dev->>Dev: 3. Create git tag (v1.0.0)
    Dev->>GH: 4. Push commit + tag
    
    GH->>Actions: Tag push triggers workflow
    
    Actions->>Actions: Verify tag matches pyproject.toml version
    alt Version Mismatch
        Actions-->>GH: ❌ Fail with error
    else Version Match
        Actions->>Actions: Build package (wheel + sdist)
        Actions->>GH: Create GitHub Release with artifacts
        Actions->>PyPI: Publish package
        PyPI-->>Actions: ✅ Published
    end

Release Steps

# 1. Update version in pyproject.toml
#    Change: version = "0.4.1" → version = "0.5.0"

# 2. Commit the version bump
git add pyproject.toml
git commit -m "chore: release v0.5.0"

# 3. Create a git tag matching the version
git tag v0.5.0

# 4. Push both commit and tag
git push && git push --tags

That's it! The workflow automatically:

  • ✅ Verifies the tag version matches pyproject.toml
  • ✅ Builds the package (wheel + source distribution)
  • ✅ Creates a GitHub Release with auto-generated notes
  • ✅ Publishes to PyPI

Architecture Overview

flowchart LR
    subgraph "Your Machine"
        A[pyproject.toml<br/>version = X.Y.Z] --> B[git tag vX.Y.Z]
    end
    
    subgraph "GitHub"
        B --> C{Tag Push<br/>Detected}
        C --> D[release.yml<br/>Workflow]
    end
    
    subgraph "GitHub Actions"
        D --> E{Version<br/>Check}
        E -->|Match| F[python -m build]
        E -->|Mismatch| G[❌ Fail]
        F --> H[dist/*.whl<br/>dist/*.tar.gz]
    end
    
    subgraph "Outputs"
        H --> I[GitHub Release<br/>with artifacts]
        H --> J[PyPI Package<br/>pip install]
    end
    
    style A fill:#51cf66
    style J fill:#339af0
    style I fill:#ffd43b
    style G fill:#ff6b6b

Understanding Git Tags

For Maintainers: Tags are NOT branches. They're immutable pointers to specific commits — think "bookmarks" for releases.

gitGraph
    commit id: "a1b2c3"
    commit id: "d4e5f6"
    commit id: "g7h8i9" tag: "v0.4.0"
    commit id: "j0k1l2"
    commit id: "m3n4o5" tag: "v0.4.1"
    commit id: "p6q7r8"
Concept Branch Tag
What it is Moving pointer that advances with commits Fixed pointer to one specific commit
Purpose Track ongoing work Mark releases/milestones
Changes? Moves forward as you commit Never moves (immutable)

Why git push --tags is separate: Tags live in a separate namespace from branches. This is intentional — you might create local tags for testing that you don't want to push. Pushing tags should be deliberate since they trigger releases.

git push              # Pushes commits only
git push --tags       # Pushes tags only
git push && git push --tags  # Push both

PyPI Trusted Publishing (OIDC)

This repository uses OpenID Connect (OIDC) Trusted Publishing — a secure, credential-free way to publish to PyPI. No API tokens are stored in GitHub secrets.

Current Configuration

Setting Value
Repository joon-solutions/minimal-csv-diff
Workflow release.yml
Environment (Any)

How OIDC Authentication Works

Unlike traditional API tokens (which can leak or expire), OIDC uses cryptographic identity verification:

sequenceDiagram
    participant Actions as GitHub Actions
    participant GitHub as GitHub (Identity Provider)
    participant PyPI as PyPI Registry
    
    Note over Actions: Workflow triggered by tag push
    
    Actions->>GitHub: Request identity token
    GitHub->>GitHub: Generate signed JWT containing:<br/>• repo: joon-solutions/minimal-csv-diff<br/>• workflow: release.yml<br/>• ref: refs/tags/v1.0.0
    GitHub-->>Actions: Signed JWT token
    
    Actions->>PyPI: Upload package + JWT token
    
    PyPI->>PyPI: Verify JWT signature<br/>using GitHub's public key
    PyPI->>PyPI: Check claims match<br/>trusted publisher config
    
    alt Claims Match
        PyPI-->>Actions: ✅ Upload accepted
    else Claims Don't Match
        PyPI-->>Actions: ❌ Rejected
    end

Why This Is Secure

The JWT token GitHub generates contains verifiable claims:

{
  "repository": "joon-solutions/minimal-csv-diff",
  "workflow": "release.yml", 
  "ref": "refs/tags/v1.0.0",
  "iss": "https://token.actions.githubusercontent.com",
  "exp": 1709914200
}

PyPI verifies:

  1. Signature — Cryptographically signed by GitHub's private key (impossible to forge)
  2. Repository — Must match joon-solutions/minimal-csv-diff
  3. Workflow — Must be release.yml
  4. Expiration — Token valid for ~15 minutes only
flowchart LR
    subgraph "Traditional API Token"
        A1[Generate token on PyPI] --> A2[Store in GitHub Secrets]
        A2 --> A3[Token used in workflow]
        A3 --> A4[⚠️ Risks:<br/>• Can leak in logs<br/>• Works from anywhere<br/>• Must rotate manually]
    end
    
    subgraph "OIDC Trusted Publishing"
        B1[One-time trust setup<br/>on PyPI] --> B2[GitHub generates<br/>fresh JWT per run]
        B2 --> B3[PyPI verifies<br/>cryptographic signature]
        B3 --> B4[✅ Benefits:<br/>• No secrets to leak<br/>• Only works from this repo<br/>• Auto-expires in minutes]
    end
    
    style A4 fill:#ff6b6b
    style B4 fill:#51cf66

Setup (Already Done)

For new projects, you'd configure this on PyPI:

  1. Go to pypi.org → Your Project → Settings → Publishing
  2. Add a new "Trusted Publisher":
    • Owner: joon-solutions
    • Repository: minimal-csv-diff
    • Workflow name: release.yml
    • Environment: (leave blank)
  3. Save

Troubleshooting

Issue Cause Fix
"Tag version doesn't match" pyproject.toml version ≠ git tag Ensure v1.2.3 tag matches version = "1.2.3"
"Trusted publishing failed" PyPI not configured Add trusted publisher on PyPI (see above)
Workflow didn't trigger Tag not pushed Run git push --tags
"Version already exists" Re-releasing same version Bump version number, create new tag

Why Tag-Triggered Releases?

Approach Pros Cons
Tag-triggered Explicit control, simple debugging, works locally Manual version bump
Semantic-release Auto-versioning from commits Fragile, commit message format matters, hard to debug
Push to main Fully automated Publishes every commit, version conflicts

We chose tag-triggered because you decide when to release, and the workflow is transparent and debuggable

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minimal_csv_diff-0.4.2.tar.gz (33.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

minimal_csv_diff-0.4.2-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file minimal_csv_diff-0.4.2.tar.gz.

File metadata

  • Download URL: minimal_csv_diff-0.4.2.tar.gz
  • Upload date:
  • Size: 33.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minimal_csv_diff-0.4.2.tar.gz
Algorithm Hash digest
SHA256 371d1380b0fc9e55eb0485b5d87688fa2fd4f2e9238051ba17ce5eee18ad5327
MD5 9f8f3557f8e64a494bc960e35c8ad9cd
BLAKE2b-256 a1f03bf5f67f4c7a6f62e64c7fe4afddb86cfcf96ccdb930f210078691d159ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for minimal_csv_diff-0.4.2.tar.gz:

Publisher: release.yml on joon-solutions/minimal-csv-diff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file minimal_csv_diff-0.4.2-py3-none-any.whl.

File metadata

File hashes

Hashes for minimal_csv_diff-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 25e98633b948328a4b44ec1f12f0176176bb7885629fd1fe09311cf759862958
MD5 ce25f43000bd2c68916339d1aabac977
BLAKE2b-256 3d6aa1056bc738ea841085bba02a04360a51f9fe4987d04387f4fb4c2afdb631

See more details on using hashes here.

Provenance

The following attestation bundles were made for minimal_csv_diff-0.4.2-py3-none-any.whl:

Publisher: release.yml on joon-solutions/minimal-csv-diff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page