Minimal tool to compare CSV files and generate diff reports

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

📊 minimal-csv-diff

A high-performance tool to compare CSV files and generate diff reports for data validation. Built with Polars for 10-100x faster comparisons on large datasets.

✨ Features

🔍 Compare two CSV files with composite key matching
⚡ Blazing fast — handles 800k+ rows in seconds (Polars-powered)
🎯 Interactive mode or CLI with explicit keys
📋 Detailed diff reports showing unique rows and column-level changes
🤖 LLM-agent friendly API for programmatic access
📁 Exports results to CSV format for further analysis

🚀 Quick Start

Option 1: Run Instantly (No Installation) ⭐

uvx minimal-csv-diff

Option 2: Install & Run

pip install minimal-csv-diff
minimal-csv-diff

Option 3: CLI with Explicit Keys

minimal-csv-diff file1.csv file2.csv --key="id,date,name" --output=diff.csv

🎮 Try the Demo

Want to see it in action? Check out the demo directory:

cd demo/
minimal-csv-diff
# Follow prompts: select files 0,1 and choose a key column
# See the magic happen! ✨

The demo includes sample CSV files and shows how the tool identifies:

🔴 Unique rows (exist in only one file)
🟡 Column differences (same record, different values)
✅ Matching records (excluded from output)

🧠 How It Works

The diff engine uses a three-phase comparison strategy:

flowchart TD
    subgraph "Phase 1: Load & Normalize"
        A["CSV File 1"] --> B["Polars DataFrame"]
        C["CSV File 2"] --> D["Polars DataFrame"]
        B --> E["Normalize Key Columns<br/>(strip whitespace, fill nulls)"]
        D --> F["Normalize Key Columns<br/>(strip whitespace, fill nulls)"]
    end
    
    subgraph "Phase 2: Identify Differences"
        E --> G["Anti-Join: df1 ← df2<br/>(rows only in file1)"]
        F --> H["Anti-Join: df2 ← df1<br/>(rows only in file2)"]
        E --> I["Inner Join on Keys<br/>(rows in both files)"]
        F --> I
        I --> J["Compare Non-Key Columns<br/>(find modifications)"]
    end
    
    subgraph "Phase 3: Output"
        G --> K["UNIQUE ROW<br/>from file1"]
        H --> L["UNIQUE ROW<br/>from file2"]
        J --> M["MODIFIED ROW<br/>with failed_columns"]
        K --> N["diff.csv"]
        L --> N
        M --> N
    end
    
    style E fill:#51cf66
    style F fill:#51cf66
    style N fill:#339af0

Key Matching Logic

Scenario	Result
Key exists in file1 only	`UNIQUE ROW` (source: file1)
Key exists in file2 only	`UNIQUE ROW` (source: file2)
Key exists in both, values identical	No output (match)
Key exists in both, values differ	Two rows showing old → new values

📤 Output Format

When differences are found, generates a diff.csv with:

Column	Description
`surrogate_key`	Concatenated key fields (e.g., `acme\|sales\|orders`)
`source`	Which file the row comes from
`failed_columns`	`UNIQUE ROW` or list of changed columns
...all columns	Complete row data for comparison

Example output:

"source","failed_columns","surrogate_key","id","name","value"
"old.csv","value","1|Alice","1","Alice","100"
"new.csv","value","1|Alice","1","Alice","150"
"old.csv","UNIQUE ROW","2|Bob","2","Bob","200"

🤖 Programmatic API (LLM-Agent Friendly)

from minimal_csv_diff import compare_csv_files, quick_csv_diff

# Option 1: Explicit keys
result = compare_csv_files(
    'old.csv', 'new.csv',
    key_columns=['id', 'date'],
    output_file='diff.csv'
)

# Option 2: Auto-detect keys
result = quick_csv_diff('old.csv', 'new.csv')

# Check results
if result['differences_found']:
    print(f"Found {result['summary']['total_differences']} differences")
    print(f"Output: {result['output_file']}")

Return structure:

{
    'status': 'success' | 'no_differences' | 'error',
    'differences_found': bool,
    'output_file': str | None,
    'summary': {
        'total_differences': int,
        'unique_rows': int,
        'modified_rows': int,
        'common_columns': int,
        'key_columns_used': list
    },
    'error_message': str | None
}

💡 Use Cases

🔄 Data validation between different data sources
🔧 ETL pipeline testing — compare before/after transformations
🗄️ Database migration verification — ensure data integrity
📊 Looker/BI validation — compare query results across environments
🧪 A/B testing — identify differences in experimental datasets
🤖 LLM workflows — automated data quality checks

⚡ Performance

Built on Polars for maximum performance:

Dataset Size	Time
10k rows	< 1 second
100k rows	~2 seconds
800k rows	~20 seconds

All string normalization runs as native Rust SIMD operations — no Python UDF overhead.

🛠️ Development

This project uses uv for dependency management.

git clone https://github.com/joon-solutions/minimal-csv-diff
cd minimal-csv-diff
uv sync
uv run pytest tests/ -v
uv run minimal-csv-diff

📋 Requirements

Python >= 3.10
polars >= 0.20.0

📦 Publishing a New Release

This project uses a tag-triggered CI/CD workflow for publishing to PyPI. This approach is simple, explicit, and foolproof.

How It Works

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant Actions as GitHub Actions
    participant PyPI as PyPI Registry
    
    Dev->>Dev: 1. Update version in pyproject.toml
    Dev->>Dev: 2. Commit changes
    Dev->>Dev: 3. Create git tag (v1.0.0)
    Dev->>GH: 4. Push commit + tag
    
    GH->>Actions: Tag push triggers workflow
    
    Actions->>Actions: Verify tag matches pyproject.toml version
    alt Version Mismatch
        Actions-->>GH: ❌ Fail with error
    else Version Match
        Actions->>Actions: Build package (wheel + sdist)
        Actions->>GH: Create GitHub Release with artifacts
        Actions->>PyPI: Publish package
        PyPI-->>Actions: ✅ Published
    end

Release Steps

# 1. Update version in pyproject.toml
#    Change: version = "0.4.1" → version = "0.5.0"

# 2. Commit the version bump
git add pyproject.toml
git commit -m "chore: release v0.5.0"

# 3. Create a git tag matching the version
git tag v0.5.0

# 4. Push both commit and tag
git push && git push --tags

That's it! The workflow automatically:

✅ Verifies the tag version matches pyproject.toml
✅ Builds the package (wheel + source distribution)
✅ Creates a GitHub Release with auto-generated notes
✅ Publishes to PyPI

Architecture Overview

flowchart LR
    subgraph "Your Machine"
        A[pyproject.toml<br/>version = X.Y.Z] --> B[git tag vX.Y.Z]
    end
    
    subgraph "GitHub"
        B --> C{Tag Push<br/>Detected}
        C --> D[release.yml<br/>Workflow]
    end
    
    subgraph "GitHub Actions"
        D --> E{Version<br/>Check}
        E -->|Match| F[python -m build]
        E -->|Mismatch| G[❌ Fail]
        F --> H[dist/*.whl<br/>dist/*.tar.gz]
    end
    
    subgraph "Outputs"
        H --> I[GitHub Release<br/>with artifacts]
        H --> J[PyPI Package<br/>pip install]
    end
    
    style A fill:#51cf66
    style J fill:#339af0
    style I fill:#ffd43b
    style G fill:#ff6b6b

Understanding Git Tags

For Maintainers: Tags are NOT branches. They're immutable pointers to specific commits — think "bookmarks" for releases.

gitGraph
    commit id: "a1b2c3"
    commit id: "d4e5f6"
    commit id: "g7h8i9" tag: "v0.4.0"
    commit id: "j0k1l2"
    commit id: "m3n4o5" tag: "v0.4.1"
    commit id: "p6q7r8"

Concept	Branch	Tag
What it is	Moving pointer that advances with commits	Fixed pointer to one specific commit
Purpose	Track ongoing work	Mark releases/milestones
Changes?	Moves forward as you commit	Never moves (immutable)

Why git push --tags is separate: Tags live in a separate namespace from branches. This is intentional — you might create local tags for testing that you don't want to push. Pushing tags should be deliberate since they trigger releases.

git push              # Pushes commits only
git push --tags       # Pushes tags only
git push && git push --tags  # Push both

PyPI Trusted Publishing (OIDC)

This repository uses OpenID Connect (OIDC) Trusted Publishing — a secure, credential-free way to publish to PyPI. No API tokens are stored in GitHub secrets.

Current Configuration

Setting	Value
Repository	`joon-solutions/minimal-csv-diff`
Workflow	`release.yml`
Environment	(Any)

How OIDC Authentication Works

Unlike traditional API tokens (which can leak or expire), OIDC uses cryptographic identity verification:

sequenceDiagram
    participant Actions as GitHub Actions
    participant GitHub as GitHub (Identity Provider)
    participant PyPI as PyPI Registry
    
    Note over Actions: Workflow triggered by tag push
    
    Actions->>GitHub: Request identity token
    GitHub->>GitHub: Generate signed JWT containing:<br/>• repo: joon-solutions/minimal-csv-diff<br/>• workflow: release.yml<br/>• ref: refs/tags/v1.0.0
    GitHub-->>Actions: Signed JWT token
    
    Actions->>PyPI: Upload package + JWT token
    
    PyPI->>PyPI: Verify JWT signature<br/>using GitHub's public key
    PyPI->>PyPI: Check claims match<br/>trusted publisher config
    
    alt Claims Match
        PyPI-->>Actions: ✅ Upload accepted
    else Claims Don't Match
        PyPI-->>Actions: ❌ Rejected
    end

Why This Is Secure

The JWT token GitHub generates contains verifiable claims:

{
  "repository": "joon-solutions/minimal-csv-diff",
  "workflow": "release.yml", 
  "ref": "refs/tags/v1.0.0",
  "iss": "https://token.actions.githubusercontent.com",
  "exp": 1709914200
}

PyPI verifies:

Signature — Cryptographically signed by GitHub's private key (impossible to forge)
Repository — Must match joon-solutions/minimal-csv-diff
Workflow — Must be release.yml
Expiration — Token valid for ~15 minutes only

flowchart LR
    subgraph "Traditional API Token"
        A1[Generate token on PyPI] --> A2[Store in GitHub Secrets]
        A2 --> A3[Token used in workflow]
        A3 --> A4[⚠️ Risks:<br/>• Can leak in logs<br/>• Works from anywhere<br/>• Must rotate manually]
    end
    
    subgraph "OIDC Trusted Publishing"
        B1[One-time trust setup<br/>on PyPI] --> B2[GitHub generates<br/>fresh JWT per run]
        B2 --> B3[PyPI verifies<br/>cryptographic signature]
        B3 --> B4[✅ Benefits:<br/>• No secrets to leak<br/>• Only works from this repo<br/>• Auto-expires in minutes]
    end
    
    style A4 fill:#ff6b6b
    style B4 fill:#51cf66

Setup (Already Done)

For new projects, you'd configure this on PyPI:

Go to pypi.org → Your Project → Settings → Publishing
Add a new "Trusted Publisher":
- Owner: joon-solutions
- Repository: minimal-csv-diff
- Workflow name: release.yml
- Environment: (leave blank)
Save

Troubleshooting

Issue	Cause	Fix
"Tag version doesn't match"	pyproject.toml version ≠ git tag	Ensure `v1.2.3` tag matches `version = "1.2.3"`
"Trusted publishing failed"	PyPI not configured	Add trusted publisher on PyPI (see above)
Workflow didn't trigger	Tag not pushed	Run `git push --tags`
"Version already exists"	Re-releasing same version	Bump version number, create new tag

Why Tag-Triggered Releases?

Approach	Pros	Cons
Tag-triggered ✅	Explicit control, simple debugging, works locally	Manual version bump
Semantic-release	Auto-versioning from commits	Fragile, commit message format matters, hard to debug
Push to main	Fully automated	Publishes every commit, version conflicts

We chose tag-triggered because you decide when to release, and the workflow is transparent and debuggable

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

luutuankiet

Release history Release notifications | RSS feed

This version

0.4.2

Mar 9, 2026

0.4.1

Aug 29, 2025

0.4.0

Jul 25, 2025

0.3.0

Jul 13, 2025

0.2.0

Jun 3, 2025

0.1.0

May 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minimal_csv_diff-0.4.2.tar.gz (33.5 kB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

minimal_csv_diff-0.4.2-py3-none-any.whl (26.8 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file minimal_csv_diff-0.4.2.tar.gz.

File metadata

Download URL: minimal_csv_diff-0.4.2.tar.gz
Upload date: Mar 9, 2026
Size: 33.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minimal_csv_diff-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`371d1380b0fc9e55eb0485b5d87688fa2fd4f2e9238051ba17ce5eee18ad5327`
MD5	`9f8f3557f8e64a494bc960e35c8ad9cd`
BLAKE2b-256	`a1f03bf5f67f4c7a6f62e64c7fe4afddb86cfcf96ccdb930f210078691d159ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for minimal_csv_diff-0.4.2.tar.gz:

Publisher: release.yml on joon-solutions/minimal-csv-diff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: minimal_csv_diff-0.4.2.tar.gz
- Subject digest: 371d1380b0fc9e55eb0485b5d87688fa2fd4f2e9238051ba17ce5eee18ad5327
- Sigstore transparency entry: 1065594083
- Sigstore integration time: Mar 9, 2026
Source repository:
- Permalink: joon-solutions/minimal-csv-diff@8d6430385b60045d556d6f5401b2b0bb78e641a7
- Branch / Tag: refs/tags/v0.4.2
- Owner: https://github.com/joon-solutions
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8d6430385b60045d556d6f5401b2b0bb78e641a7
- Trigger Event: push

File details

Details for the file minimal_csv_diff-0.4.2-py3-none-any.whl.

File metadata

Download URL: minimal_csv_diff-0.4.2-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 26.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minimal_csv_diff-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`25e98633b948328a4b44ec1f12f0176176bb7885629fd1fe09311cf759862958`
MD5	`ce25f43000bd2c68916339d1aabac977`
BLAKE2b-256	`3d6aa1056bc738ea841085bba02a04360a51f9fe4987d04387f4fb4c2afdb631`

See more details on using hashes here.

Provenance

The following attestation bundles were made for minimal_csv_diff-0.4.2-py3-none-any.whl:

Publisher: release.yml on joon-solutions/minimal-csv-diff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: minimal_csv_diff-0.4.2-py3-none-any.whl
- Subject digest: 25e98633b948328a4b44ec1f12f0176176bb7885629fd1fe09311cf759862958
- Sigstore transparency entry: 1065594110
- Sigstore integration time: Mar 9, 2026
Source repository:
- Permalink: joon-solutions/minimal-csv-diff@8d6430385b60045d556d6f5401b2b0bb78e641a7
- Branch / Tag: refs/tags/v0.4.2
- Owner: https://github.com/joon-solutions
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8d6430385b60045d556d6f5401b2b0bb78e641a7
- Trigger Event: push

minimal-csv-diff 0.4.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

📊 minimal-csv-diff

✨ Features

🚀 Quick Start

Option 1: Run Instantly (No Installation) ⭐

Option 2: Install & Run

Option 3: CLI with Explicit Keys

🎮 Try the Demo

🧠 How It Works

Key Matching Logic

📤 Output Format

🤖 Programmatic API (LLM-Agent Friendly)

💡 Use Cases

⚡ Performance

🛠️ Development

📋 Requirements

📦 Publishing a New Release

How It Works

Release Steps

Architecture Overview

Understanding Git Tags

PyPI Trusted Publishing (OIDC)

Current Configuration

How OIDC Authentication Works

Why This Is Secure

Setup (Already Done)

Troubleshooting

Why Tag-Triggered Releases?

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance