A high-performance CSV Change Data Capture tool

These details have not been verified by PyPI

Project links

Project description

CSV CDC (Change Data Capture) Tool

A high-performance Change Data Capture (CDC) tool for comparing CSV files and detecting differences. Built with Python and optimized for speed using Polars, NumPy, and xxHash.

🚀 Features

Lightning Fast: Uses Polars for CSV reading and xxHash for efficient comparisons
Large File Support: Chunked processing for files of any size with memory optimization
Flexible Primary Keys: Support for single or composite primary keys
Auto-Detection: Automatically detect primary keys by analyzing data patterns
Multiple Output Formats: diff, JSON, rowmark, and word-diff formats
Column Selection: Include/exclude specific columns from comparison
Progress Tracking: Built-in progress bars for large files
Memory Efficient: Optimized for handling large CSV files with configurable chunk processing
Cross-Platform: Works on Windows, macOS, and Linux

📦 Installation

From PyPI

pip install csv-cdc
csvcdc old_file.csv new_file.csv

From Source

git clone https://github.com/maurohkcba/csv-cdc.git
cd csv-cdc
pip install -r requirements.txt
python setup.py install

Development Installation

git clone https://github.com/maurohkcba/csv-cdc.git
cd csv-cdc
pip install -e .

🏃‍♂️ Quick Start

Basic Usage

Compare two CSV files using the first column as primary key:

python csvcdc.py old_file.csv new_file.csv

Large File Usage

For very large files that cause memory issues:

python csvcdc.py huge_file1.csv huge_file2.csv --largefiles 1

Example Output

# Additions (2)
+ 4,New Product,99.99,Electronics
+ 5,Another Item,45.00,Books

# Modifications (1)
- 2,Laptop,999.99,Electronics
+ 2,Laptop,899.99,Electronics

# Deletions (1)
- 3,Old Product,25.99,Discontinued

📚 Detailed Examples

1. Basic File Comparison

Create sample files:

base.csv

id,name,price,category
1,Widget,10.99,Tools
2,Gadget,25.50,Electronics
3,Book,15.99,Education

delta.csv

id,name,price,category
1,Widget,12.99,Tools
2,Gadget,25.50,Electronics
4,Magazine,8.99,Education

Compare the files:

python csvcdc.py base.csv delta.csv --primary-key 0

Output:

# Additions (1)
+ 4,Magazine,8.99,Education

# Modifications (1)
- 1,Widget,10.99,Tools
+ 1,Widget,12.99,Tools

# Deletions (1)
- 3,Book,15.99,Education

2. Large File Processing

For files that are too large to fit in memory (multi-GB files):

# Enable large file mode with default chunk size (500,000 rows)
python csvcdc.py large_base.csv large_delta.csv --largefiles 1 --time

# Custom chunk size for very large files
python csvcdc.py huge_base.csv huge_delta.csv --largefiles 1 --chunk-size 100000

# Large file with JSON output
python csvcdc.py massive_file1.csv massive_file2.csv \
  --largefiles 1 \
  --chunk-size 250000 \
  --format json \
  --time > changes.json

3. Custom Primary Key

Use multiple columns as primary key:

python csvcdc.py base.csv delta.csv --primary-key 0,1

4. Auto-Detect Primary Key

Let the tool automatically detect the best primary key:

python csvcdc.py base.csv delta.csv --autopk 1

For large files with auto-detection:

python csvcdc.py large_base.csv large_delta.csv --autopk 1 --largefiles 1

5. Column Selection

Compare only specific columns:

# Compare only columns 0, 1, and 2
python csvcdc.py base.csv delta.csv --columns 0,1,2

# Ignore column 3 (category) from comparison
python csvcdc.py base.csv delta.csv --ignore-columns 3

6. Different Output Formats

JSON Format:

python csvcdc.py base.csv delta.csv --format json

{
  "Additions": [
    "4,Magazine,8.99,Education"
  ],
  "Modifications": [
    {
      "Original": "1,Widget,10.99,Tools",
      "Current": "1,Widget,12.99,Tools"
    }
  ],
  "Deletions": [
    "3,Book,15.99,Education"
  ]
}

Rowmark Format:

python csvcdc.py base.csv delta.csv --format rowmark

ADDED,4,Magazine,8.99,Education
MODIFIED,1,Widget,12.99,Tools

Word Diff Format:

python csvcdc.py base.csv delta.csv --format word-diff

7. Custom Separators

For tab-separated files:

python csvcdc.py base.tsv delta.tsv --separator '\t'

For pipe-separated files:

python csvcdc.py base.csv delta.csv --separator '|'

8. Performance Monitoring

Track execution time and show progress:

python csvcdc.py large_base.csv large_delta.csv --time --progressbar 1

9. Large File Example

For files with millions of rows:

# Auto-detect primary key, show progress, time execution, large file mode
python csvcdc.py huge_base.csv huge_delta.csv \
  --autopk 1 \
  --progressbar 1 \
  --time \
  --largefiles 1 \
  --chunk-size 200000 \
  --format json > changes.json

10. Memory Error Scenarios

If you encounter memory allocation errors like:

Error: Unable to allocate 203. GiB for an array with shape (5196564, 42)

Use large file mode:

python csvcdc.py problematic_file1.csv problematic_file2.csv \
  --largefiles 1 \
  --chunk-size 50000 \
  --progressbar 1 \
  --time

🔧 Command Line Options

Option	Description	Default
`base_csv`	Base CSV file path	Required
`delta_csv`	Delta CSV file path	Required
`-p, --primary-key`	Primary key column positions (comma-separated)	`0`
`-s, --separator`	Field separator	`,`
`--columns`	Columns to compare (comma-separated)	All columns
`--ignore-columns`	Columns to ignore (comma-separated)	None
`--include`	Columns to include in output	All columns
`-o, --format`	Output format: diff, json, rowmark, word-diff	`diff`
`--time`	Show execution time	False
`--progressbar`	Show progress bar (0 or 1)	`1`
`--autopk`	Auto-detect primary key (0 or 1)	`0`
`--largefiles`	Enable large file optimization with chunked processing (0 or 1)	`0`
`--chunk-size`	Chunk size for large file processing	`500000`
`--version`	Show version	-

📏 Large File Processing

When to Use Large File Mode

Enable --largefiles 1 when:

Files are larger than available RAM
You get memory allocation errors
Files have millions of rows
You want to minimize memory usage

Chunk Size Guidelines

File Size	Recommended Chunk Size	Memory Usage
< 100MB	Default (no chunking)	Full file in RAM
100MB - 1GB	500,000 rows	~500MB RAM
1GB - 10GB	200,000 rows	~200MB RAM
> 10GB	50,000 - 100,000 rows	~50-100MB RAM

Large File Examples

# For 5GB+ files
python csvcdc.py massive1.csv massive2.csv --largefiles 1 --chunk-size 100000

# For extreme cases (50GB+ files)
python csvcdc.py extreme1.csv extreme2.csv --largefiles 1 --chunk-size 25000

# Balanced performance and memory
python csvcdc.py large1.csv large2.csv --largefiles 1 --chunk-size 250000

🐍 Python API Usage

Basic API Usage

from csvcdc import CSVCDC

# Create CDC instance
cdc = CSVCDC(separator=',', primary_key=[0])

# Compare files
result = cdc.compare('base.csv', 'delta.csv')

# Access results
print(f"Additions: {len(result.additions)}")
print(f"Modifications: {len(result.modifications)}")
print(f"Deletions: {len(result.deletions)}")

# Process individual changes
for addition in result.additions:
    print(f"Added: {addition}")

for modification in result.modifications:
    print(f"Changed from: {modification['Original']}")
    print(f"Changed to: {modification['Current']}")

for deletion in result.deletions:
    print(f"Deleted: {deletion}")

Large File API Usage

from csvcdc import CSVCDC

# Large file configuration
cdc = CSVCDC(
    separator=',',
    primary_key=[0],
    largefiles=1,  # Enable chunked processing
    chunk_size=100000,  # Process 100k rows at a time
    progressbar=1
)

# Compare large files
result = cdc.compare('huge_base.csv', 'huge_delta.csv')

# Process results normally
print(f"Found {len(result.additions)} additions")
print(f"Found {len(result.modifications)} modifications")
print(f"Found {len(result.deletions)} deletions")

Advanced API Usage

from csvcdc import CSVCDC, OutputFormatter

# Advanced configuration with large file support
cdc = CSVCDC(
    separator=',',
    primary_key=[0, 1],  # Composite primary key
    ignore_columns=[3, 4],  # Ignore columns 3 and 4
    progressbar=1,
    autopk=0,
    largefiles=1,  # Enable for large files
    chunk_size=200000  # Custom chunk size
)

# Compare files
result = cdc.compare('data/products_old.csv', 'data/products_new.csv')

# Use different formatters
diff_output = OutputFormatter.format_diff(result)
json_output = OutputFormatter.format_json(result)
rowmark_output = OutputFormatter.format_rowmark(result)

print("Diff format:")
print(diff_output)

# Save JSON output
with open('changes.json', 'w') as f:
    f.write(json_output)

Custom Processing

from csvcdc import CSVCDC
import json

def process_large_changes(base_file, delta_file):
    # Optimized for large files
    cdc = CSVCDC(
        autopk=1,  # Auto-detect primary key
        largefiles=1,  # Enable chunked processing
        chunk_size=150000,  # Custom chunk size
        progressbar=1
    )
    
    result = cdc.compare(base_file, delta_file)
    
    # Custom processing
    changes_summary = {
        'total_additions': len(result.additions),
        'total_modifications': len(result.modifications),
        'total_deletions': len(result.deletions),
        'change_rate': (len(result.additions) + len(result.modifications) + len(result.deletions)) / 100
    }
    
    # Process specific types of changes
    price_changes = []
    for mod in result.modifications:
        orig_parts = mod['Original'].split(',')
        curr_parts = mod['Current'].split(',')
        
        # Assuming price is in column 2
        if len(orig_parts) > 2 and len(curr_parts) > 2:
            try:
                old_price = float(orig_parts[2])
                new_price = float(curr_parts[2])
                if old_price != new_price:
                    price_changes.append({
                        'id': orig_parts[0],
                        'old_price': old_price,
                        'new_price': new_price,
                        'change': new_price - old_price
                    })
            except ValueError:
                pass
    
    changes_summary['price_changes'] = price_changes
    return changes_summary

# Usage
summary = process_large_changes('old_products.csv', 'new_products.csv')
print(json.dumps(summary, indent=2))

🔍 Auto Primary Key Detection

The auto primary key detection feature analyzes your data to find the best column(s) to use as primary key:

# Enable auto-detection
cdc = CSVCDC(autopk=1)
result = cdc.compare('file1.csv', 'file2.csv')

# Auto-detection with large files
cdc = CSVCDC(autopk=1, largefiles=1)
result = cdc.compare('large_file1.csv', 'large_file2.csv')

The algorithm considers:

Uniqueness: How unique values are in each column
Match Rate: How well values match between files
Composite Keys: Tests combinations of columns

Example of Auto-Detection Output

Auto-detecting primary key...
Testing single columns: 100%|██████████| 5/5
Testing column combinations: 100%|███| 3/3
Auto-detected primary key: columns [0, 1] (score: 0.943)

📊 Performance Benchmarks

Performance comparison on different file sizes:

Small Files (< 100MB)

Tool	Time	Memory
csv-cdc	12.3s	150MB
Traditional diff	45.2s	400MB
Manual Python	38.7s	320MB

Large Files (1GB+)

Mode	File Size	Time	Peak Memory
Regular	1GB	45s	2.1GB
Large File Mode	1GB	52s	350MB
Large File Mode	10GB	8.5min	450MB
Large File Mode	50GB	42min	500MB

Optimization Features

Polars Integration: Ultra-fast CSV reading
xxHash: High-speed hashing algorithm
Vectorized Operations: NumPy-based processing
Chunked Processing: Memory-efficient large file handling
Progressive Loading: Streaming for huge files
Garbage Collection: Automatic memory cleanup between chunks

🧪 Testing

Run the test suite:

# Install test dependencies
pip install pytest pytest-cov

# Run tests
pytest tests/

# Run with coverage
pytest --cov=csvcdc tests/

# Test large file functionality
pytest tests/test_large_files.py

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

git clone https://github.com/maurohkcba/csv-cdc.git
cd csv-cdc
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .

Running Tests

pytest tests/

📜 License

This project is licensed under the MIT License LICENCE

🐛 Issues and Support

🐛 Bug Reports: GitHub Issues
💡 Feature Requests: GitHub Issues
📖 Documentation: Wiki

🚀 Roadmap

Large file chunked processing
Memory optimization for huge datasets
Support for Excel files
Database output integration
Web UI interface
Docker containerization
Cloud storage support (S3, GCS, Azure)
Parallel processing for multi-core systems
Configuration file support
Scheduled comparison jobs

⭐ Star History

If you find this tool useful, please consider giving it a star!

📈 Changelog

See CHANGELOG.md for a list of changes and version history.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Jun 3, 2025

0.1.2

Jun 3, 2025

0.1.1

Jun 3, 2025

0.1.0

Jun 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv-cdc-0.1.3.tar.gz (50.6 kB view details)

Uploaded Jun 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

csv_cdc-0.1.3-py3-none-any.whl (18.6 kB view details)

Uploaded Jun 3, 2025 Python 3

File details

Details for the file csv-cdc-0.1.3.tar.gz.

File metadata

Download URL: csv-cdc-0.1.3.tar.gz
Upload date: Jun 3, 2025
Size: 50.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.7.16

File hashes

Hashes for csv-cdc-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`500f2434bc53def86a58aae33757faeebcb41849c16d69e4c3182640db71d5b7`
MD5	`e6ab3023e0c56152a22314c1818ac59b`
BLAKE2b-256	`1389fa293ed3001a0e8b6d6a7652ee02a31c154a5a067e22de2ecd8f41d29c47`

See more details on using hashes here.

File details

Details for the file csv_cdc-0.1.3-py3-none-any.whl.

File metadata

Download URL: csv_cdc-0.1.3-py3-none-any.whl
Upload date: Jun 3, 2025
Size: 18.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.7.16

File hashes

Hashes for csv_cdc-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3fe1a2892f6e8286b92312651a293073b636ab19f906968f80f24f26cc2c015f`
MD5	`c7598b5620fd906fbf66a1a58c341ed3`
BLAKE2b-256	`f75ce1a95a83ac4b2dbfc5b47b3603ca3d355f19384e7f870a06c6707f4e63ef`

See more details on using hashes here.

csv-cdc 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CSV CDC (Change Data Capture) Tool

🚀 Features

📦 Installation

From PyPI

From Source

Development Installation

🏃‍♂️ Quick Start

Basic Usage

Large File Usage

Example Output

📚 Detailed Examples

1. Basic File Comparison

2. Large File Processing

3. Custom Primary Key

4. Auto-Detect Primary Key

5. Column Selection

6. Different Output Formats

7. Custom Separators

8. Performance Monitoring

9. Large File Example

10. Memory Error Scenarios

🔧 Command Line Options

📏 Large File Processing

When to Use Large File Mode

Chunk Size Guidelines

Large File Examples

🐍 Python API Usage

Basic API Usage

Large File API Usage

Advanced API Usage

Custom Processing

🔍 Auto Primary Key Detection

Example of Auto-Detection Output

📊 Performance Benchmarks

Small Files (< 100MB)

Large Files (1GB+)

Optimization Features

🧪 Testing

🤝 Contributing

Development Setup

Running Tests

📜 License

🐛 Issues and Support

🚀 Roadmap

⭐ Star History

📈 Changelog

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes