CLI to compare two tabular datasets and produce a concise markdown report
Project description
Data Whisperer
Intelligent dataset comparison that reveals what truly changed.
Data Whisperer (Copydata) is a command-line tool that compares two versions of tabular datasets and generates insightful reports highlighting meaningful changes. Unlike simple diff tools, it understands your data's structure and semantics.
Features
- 🔍 Smart Column Matching - Automatically detects renamed columns using similarity scoring
- 📊 Statistical Analysis - Compares numeric distributions, outliers, and trends
- 🏷️ Category Tracking - Identifies added/removed categories and value changes
- 📈 Change Scoring - Prioritizes the most significant changes for quick review
- 📝 Multiple Formats - Supports CSV, Excel (XLS/XLSX), and JSON files
- 📄 Flexible Output - Generate markdown reports and/or machine-readable JSON
Installation
pip install data-whisperer
Quick Start
Compare two datasets:
copydata old_data.csv new_data.csv
Save report to file:
copydata data_v1.xlsx data_v2.xlsx --output-save --output report.md
Generate both markdown and JSON output:
copydata before.csv after.csv --output-save --json
Usage
copydata [-h] [--output OUTPUT] [--output-save] [--json] [--rename-threshold THRESHOLD] a b
Arguments
a- Path to dataset A (older version)b- Path to dataset B (newer version)
Options
--output,-o- Output filename for markdown report (default:copydata_report.md)--output-save- Save reports to files instead of printing to stdout--json,-j- Also generate JSON output with full comparison data--rename-threshold- Similarity threshold for detecting renamed columns (default: 0.82)
What Does It Analyze?
Summary Statistics
- Row count changes
- Overall null value percentages
- Duplicate row detection
Structural Changes
- Added columns
- Removed columns
- Renamed columns (with similarity scores)
Column-Level Analysis
For Numeric Columns:
- Mean, median, and standard deviation
- Min/max value changes
- Outlier detection using IQR method
- Percentage changes in key metrics
For Categorical Columns:
- Unique value counts
- Top value distributions
- New categories added
- Categories removed
- Common category overlap
Example Output
# Data Whisperer Report
## Summary
- Row count A: 1000, B: 1200, Δ: 200
- Total nulls A: 50 (0.50%), B: 75 (0.62%)
- Duplicate rows A: 5, B: 3, Δ: -2
## Structural Changes
- Added columns (1): customer_segment
- Removed columns (0): None
- Probable renames (1):
- user_id -> customer_id (similarity 0.850)
## Column Level Changes
### revenue
- Type A: numeric, Type B: numeric
- Mean: A: 1.23K, B: 1.45K
- Mean % change: 17.89%
- Outliers A: 12, B: 18
Requirements
- Python 3.7+
- pandas
- numpy
Use Cases
- Data Pipeline Monitoring - Track changes in daily/weekly data refreshes
- Model Retraining - Understand how training data evolved between versions
- ETL Validation - Verify transformations produced expected changes
- Schema Migration - Document structural changes during database updates
- Data Quality Auditing - Identify unexpected changes in production data
Advanced Features
Column Rename Detection
Data Whisperer uses fuzzy string matching to detect renamed columns. Adjust sensitivity:
copydata old.csv new.csv --rename-threshold 0.9 # More strict
copydata old.csv new.csv --rename-threshold 0.7 # More lenient
Type Inference
Automatically classifies columns as:
- Numeric - For statistical analysis
- Categorical - For tracking value changes (≤20 unique values)
- Text - For high-cardinality strings
- Datetime - For temporal data
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
License
MIT License
Author
Data Whisperer - Making dataset evolution transparent and actionable.
Focus on meaningful numeric shifts, category churn, and schema changes.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file copydata-0.1.3.tar.gz.
File metadata
- Download URL: copydata-0.1.3.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba7a7dd23063586b0d689be04bf147185f9a8a4ad97806ca1ed84e6d9e1bb4da
|
|
| MD5 |
db0794b05b8e5efb132c6a4c8bf9ef54
|
|
| BLAKE2b-256 |
4d9023abe3f9f722e10152c856f3d272c2bcf5dffed43274245fe8902cb45fde
|
File details
Details for the file copydata-0.1.3-py3-none-any.whl.
File metadata
- Download URL: copydata-0.1.3-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1701e96f6ff2bb8dfe24b4b438bba2d683cc5134e2990089bc6fcd4dce47b94
|
|
| MD5 |
6a498c1157cd6ac6d5e3052f493e3266
|
|
| BLAKE2b-256 |
f1c8f22aee834d58b38602fd3df138e2569b606e8313baa06f1bdc44e419ca42
|