CLI to compare two tabular datasets and produce a concise markdown report

These details have not been verified by PyPI

Project links

Homepage

Project description

Data Whisperer

Intelligent dataset comparison that reveals what truly changed.

Data Whisperer (Copydata) is a command-line tool that compares two versions of tabular datasets and generates insightful reports highlighting meaningful changes. Unlike simple diff tools, it understands your data's structure and semantics.

Features

🔍 Smart Column Matching - Automatically detects renamed columns using similarity scoring
📊 Statistical Analysis - Compares numeric distributions, outliers, and trends
🏷️ Category Tracking - Identifies added/removed categories and value changes
📈 Change Scoring - Prioritizes the most significant changes for quick review
📝 Multiple Formats - Supports CSV, Excel (XLS/XLSX), and JSON files
📄 Flexible Output - Generate markdown reports and/or machine-readable JSON

Installation

pip install data-whisperer

Quick Start

Compare two datasets:

copydata old_data.csv new_data.csv

Save report to file:

copydata data_v1.xlsx data_v2.xlsx --output-save --output report.md

Generate both markdown and JSON output:

copydata before.csv after.csv --output-save --json

Usage

copydata [-h] [--output OUTPUT] [--output-save] [--json] [--rename-threshold THRESHOLD] a b

Arguments

a - Path to dataset A (older version)
b - Path to dataset B (newer version)

Options

--output, -o - Output filename for markdown report (default: copydata_report.md)
--output-save - Save reports to files instead of printing to stdout
--json, -j - Also generate JSON output with full comparison data
--rename-threshold - Similarity threshold for detecting renamed columns (default: 0.82)

What Does It Analyze?

Summary Statistics

Row count changes
Overall null value percentages
Duplicate row detection

Structural Changes

Added columns
Removed columns
Renamed columns (with similarity scores)

Column-Level Analysis

For Numeric Columns:

Mean, median, and standard deviation
Min/max value changes
Outlier detection using IQR method
Percentage changes in key metrics

For Categorical Columns:

Unique value counts
Top value distributions
New categories added
Categories removed
Common category overlap

Example Output

# Data Whisperer Report

## Summary
- Row count A: 1000, B: 1200, Δ: 200
- Total nulls A: 50 (0.50%), B: 75 (0.62%)
- Duplicate rows A: 5, B: 3, Δ: -2

## Structural Changes
- Added columns (1): customer_segment
- Removed columns (0): None
- Probable renames (1):
  - user_id -> customer_id (similarity 0.850)

## Column Level Changes
### revenue
- Type A: numeric, Type B: numeric
- Mean: A: 1.23K, B: 1.45K
- Mean % change: 17.89%
- Outliers A: 12, B: 18

Requirements

Python 3.7+
pandas
numpy

Use Cases

Data Pipeline Monitoring - Track changes in daily/weekly data refreshes
Model Retraining - Understand how training data evolved between versions
ETL Validation - Verify transformations produced expected changes
Schema Migration - Document structural changes during database updates
Data Quality Auditing - Identify unexpected changes in production data

Advanced Features

Column Rename Detection

Data Whisperer uses fuzzy string matching to detect renamed columns. Adjust sensitivity:

copydata old.csv new.csv --rename-threshold 0.9  # More strict
copydata old.csv new.csv --rename-threshold 0.7  # More lenient

Type Inference

Automatically classifies columns as:

Numeric - For statistical analysis
Categorical - For tracking value changes (≤20 unique values)
Text - For high-cardinality strings
Datetime - For temporal data

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

MIT License

Author

Data Whisperer - Making dataset evolution transparent and actionable.

Focus on meaningful numeric shifts, category churn, and schema changes.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.4

Nov 10, 2025

This version

0.1.3

Nov 10, 2025

0.1.2

Nov 10, 2025

0.1.1

Nov 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copydata-0.1.3.tar.gz (9.5 kB view details)

Uploaded Nov 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

copydata-0.1.3-py3-none-any.whl (8.3 kB view details)

Uploaded Nov 10, 2025 Python 3

File details

Details for the file copydata-0.1.3.tar.gz.

File metadata

Download URL: copydata-0.1.3.tar.gz
Upload date: Nov 10, 2025
Size: 9.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for copydata-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`ba7a7dd23063586b0d689be04bf147185f9a8a4ad97806ca1ed84e6d9e1bb4da`
MD5	`db0794b05b8e5efb132c6a4c8bf9ef54`
BLAKE2b-256	`4d9023abe3f9f722e10152c856f3d272c2bcf5dffed43274245fe8902cb45fde`

See more details on using hashes here.

File details

Details for the file copydata-0.1.3-py3-none-any.whl.

File metadata

Download URL: copydata-0.1.3-py3-none-any.whl
Upload date: Nov 10, 2025
Size: 8.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for copydata-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b1701e96f6ff2bb8dfe24b4b438bba2d683cc5134e2990089bc6fcd4dce47b94`
MD5	`6a498c1157cd6ac6d5e3052f493e3266`
BLAKE2b-256	`f1c8f22aee834d58b38602fd3df138e2569b606e8313baa06f1bdc44e419ca42`

See more details on using hashes here.

copydata 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data Whisperer

Features

Installation

Quick Start

Usage

Arguments

Options

What Does It Analyze?

Summary Statistics

Structural Changes

Column-Level Analysis

Example Output

Requirements

Use Cases

Advanced Features

Column Rename Detection

Type Inference

Contributing

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes