Skip to main content

CLI to compare two tabular datasets and produce a concise markdown report

Project description

Data Whisperer

Intelligent dataset comparison that reveals what truly changed.

Data Whisperer (Copydata) is a command-line tool that compares two versions of tabular datasets and generates insightful reports highlighting meaningful changes. Unlike simple diff tools, it understands your data's structure and semantics.

Features

  • 🔍 Smart Column Matching - Automatically detects renamed columns using similarity scoring
  • 📊 Statistical Analysis - Compares numeric distributions, outliers, and trends
  • 🏷️ Category Tracking - Identifies added/removed categories and value changes
  • 📈 Change Scoring - Prioritizes the most significant changes for quick review
  • 📝 Multiple Formats - Supports CSV, Excel (XLS/XLSX), and JSON files
  • 📄 Flexible Output - Generate markdown reports and/or machine-readable JSON

Installation

pip install data-whisperer

Quick Start

Compare two datasets:

copydata old_data.csv new_data.csv

Save report to file:

copydata data_v1.xlsx data_v2.xlsx --output-save --output report.md

Generate both markdown and JSON output:

copydata before.csv after.csv --output-save --json

Usage

copydata [-h] [--output OUTPUT] [--output-save] [--json] [--rename-threshold THRESHOLD] a b

Arguments

  • a - Path to dataset A (older version)
  • b - Path to dataset B (newer version)

Options

  • --output, -o - Output filename for markdown report (default: copydata_report.md)
  • --output-save - Save reports to files instead of printing to stdout
  • --json, -j - Also generate JSON output with full comparison data
  • --rename-threshold - Similarity threshold for detecting renamed columns (default: 0.82)

What Does It Analyze?

Summary Statistics

  • Row count changes
  • Overall null value percentages
  • Duplicate row detection

Structural Changes

  • Added columns
  • Removed columns
  • Renamed columns (with similarity scores)

Column-Level Analysis

For Numeric Columns:

  • Mean, median, and standard deviation
  • Min/max value changes
  • Outlier detection using IQR method
  • Percentage changes in key metrics

For Categorical Columns:

  • Unique value counts
  • Top value distributions
  • New categories added
  • Categories removed
  • Common category overlap

Example Output

# Data Whisperer Report

## Summary
- Row count A: 1000, B: 1200, Δ: 200
- Total nulls A: 50 (0.50%), B: 75 (0.62%)
- Duplicate rows A: 5, B: 3, Δ: -2

## Structural Changes
- Added columns (1): customer_segment
- Removed columns (0): None
- Probable renames (1):
  - user_id -> customer_id (similarity 0.850)

## Column Level Changes
### revenue
- Type A: numeric, Type B: numeric
- Mean: A: 1.23K, B: 1.45K
- Mean % change: 17.89%
- Outliers A: 12, B: 18

Requirements

  • Python 3.7+
  • pandas
  • numpy

Use Cases

  • Data Pipeline Monitoring - Track changes in daily/weekly data refreshes
  • Model Retraining - Understand how training data evolved between versions
  • ETL Validation - Verify transformations produced expected changes
  • Schema Migration - Document structural changes during database updates
  • Data Quality Auditing - Identify unexpected changes in production data

Advanced Features

Column Rename Detection

Data Whisperer uses fuzzy string matching to detect renamed columns. Adjust sensitivity:

copydata old.csv new.csv --rename-threshold 0.9  # More strict
copydata old.csv new.csv --rename-threshold 0.7  # More lenient

Type Inference

Automatically classifies columns as:

  • Numeric - For statistical analysis
  • Categorical - For tracking value changes (≤20 unique values)
  • Text - For high-cardinality strings
  • Datetime - For temporal data

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

MIT License

Author

Data Whisperer - Making dataset evolution transparent and actionable.


Focus on meaningful numeric shifts, category churn, and schema changes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copydata-0.1.3.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

copydata-0.1.3-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file copydata-0.1.3.tar.gz.

File metadata

  • Download URL: copydata-0.1.3.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for copydata-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ba7a7dd23063586b0d689be04bf147185f9a8a4ad97806ca1ed84e6d9e1bb4da
MD5 db0794b05b8e5efb132c6a4c8bf9ef54
BLAKE2b-256 4d9023abe3f9f722e10152c856f3d272c2bcf5dffed43274245fe8902cb45fde

See more details on using hashes here.

File details

Details for the file copydata-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: copydata-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for copydata-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b1701e96f6ff2bb8dfe24b4b438bba2d683cc5134e2990089bc6fcd4dce47b94
MD5 6a498c1157cd6ac6d5e3052f493e3266
BLAKE2b-256 f1c8f22aee834d58b38602fd3df138e2569b606e8313baa06f1bdc44e419ca42

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page