Skip to main content

CLI to compare two tabular datasets and produce a concise markdown report

Project description

Data Whisperer

Intelligent dataset comparison that reveals what truly changed.

Data Whisperer (Copydata) is a command-line tool that compares two versions of tabular datasets and generates insightful reports highlighting meaningful changes. Unlike simple diff tools, it understands your data's structure and semantics.

Features

  • 🔍 Smart Column Matching - Automatically detects renamed columns using similarity scoring
  • 📊 Statistical Analysis - Compares numeric distributions, outliers, and trends
  • 🏷️ Category Tracking - Identifies added/removed categories and value changes
  • 📈 Change Scoring - Prioritizes the most significant changes for quick review
  • 📝 Multiple Formats - Supports CSV, Excel (XLS/XLSX), and JSON files
  • 📄 Flexible Output - Generate markdown reports and/or machine-readable JSON

Installation

pip install copydata==0.1.3

Quick Start

Compare two datasets:

copydata old_data.csv new_data.csv

Save report to file:

copydata data_v1.xlsx data_v2.xlsx --output-save --output report.md

Generate both markdown and JSON output:

copydata before.csv after.csv --output-save --json

Usage

copydata [-h] [--output OUTPUT] [--output-save] [--json] [--rename-threshold THRESHOLD] a b

Arguments

  • a - Path to dataset A (older version)
  • b - Path to dataset B (newer version)

Options

  • --output, -o - Output filename for markdown report (default: copydata_report.md)
  • --output-save - Save reports to files instead of printing to stdout
  • --json, -j - Also generate JSON output with full comparison data
  • --rename-threshold - Similarity threshold for detecting renamed columns (default: 0.82)

What Does It Analyze?

Summary Statistics

  • Row count changes
  • Overall null value percentages
  • Duplicate row detection

Structural Changes

  • Added columns
  • Removed columns
  • Renamed columns (with similarity scores)

Column-Level Analysis

For Numeric Columns:

  • Mean, median, and standard deviation
  • Min/max value changes
  • Outlier detection using IQR method
  • Percentage changes in key metrics

For Categorical Columns:

  • Unique value counts
  • Top value distributions
  • New categories added
  • Categories removed
  • Common category overlap

Example Output

# Data Whisperer Report

## Summary
- Row count A: 1000, B: 1200, Δ: 200
- Total nulls A: 50 (0.50%), B: 75 (0.62%)
- Duplicate rows A: 5, B: 3, Δ: -2

## Structural Changes
- Added columns (1): customer_segment
- Removed columns (0): None
- Probable renames (1):
  - user_id -> customer_id (similarity 0.850)

## Column Level Changes
### revenue
- Type A: numeric, Type B: numeric
- Mean: A: 1.23K, B: 1.45K
- Mean % change: 17.89%
- Outliers A: 12, B: 18

Requirements

  • Python 3.7+
  • pandas
  • numpy

Use Cases

  • Data Pipeline Monitoring - Track changes in daily/weekly data refreshes
  • Model Retraining - Understand how training data evolved between versions
  • ETL Validation - Verify transformations produced expected changes
  • Schema Migration - Document structural changes during database updates
  • Data Quality Auditing - Identify unexpected changes in production data

Advanced Features

Column Rename Detection

Data Whisperer uses fuzzy string matching to detect renamed columns. Adjust sensitivity:

copydata old.csv new.csv --rename-threshold 0.9  # More strict
copydata old.csv new.csv --rename-threshold 0.7  # More lenient

Type Inference

Automatically classifies columns as:

  • Numeric - For statistical analysis
  • Categorical - For tracking value changes (≤20 unique values)
  • Text - For high-cardinality strings
  • Datetime - For temporal data

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

MIT License

Author

Data Whisperer - Making dataset evolution transparent and actionable.


Focus on meaningful numeric shifts, category churn, and schema changes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copydata-0.1.4.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

copydata-0.1.4-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file copydata-0.1.4.tar.gz.

File metadata

  • Download URL: copydata-0.1.4.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for copydata-0.1.4.tar.gz
Algorithm Hash digest
SHA256 2c94c3cb9f9e7a6c0a6280d24f1d346ff1b3888345752774a3295358f332d182
MD5 03b5a1c80bb5feecf010182874d68410
BLAKE2b-256 87ad718f8f145b386fd6846476109bceb025643a98d7173164157098ab8f86d5

See more details on using hashes here.

File details

Details for the file copydata-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: copydata-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for copydata-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2206aa0b07f56dd023194c016bb06b021d1bfc5f690762c16e9fb301fc3297bb
MD5 54da2b83b780ad6419640272a15aa17e
BLAKE2b-256 194a173d6fee89c8740ecf9cecbd8fec10bb1533f9bf2b3d7fc954fc0ec7d7fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page