Automatically diagnose and clean messy datasets for machine learning and data science.
Project description
dataset-doctor
Automatic Dataset Diagnosis and Cleaning for Machine Learning
dataset-doctor helps you quickly identify and fix common dataset quality problems before model training.
It is built for:
- Data scientists preparing tabular data for experiments
- ML engineers standardizing preprocessing workflows
- Beginners who want safer defaults for dataset cleaning
Instead of manually writing repeated preprocessing code, you can diagnose data issues and run an automatic cleaning pipeline from either Python or the command line.
Features
- Dataset diagnosis with a readable summary report
- Missing value detection and imputation
- Duplicate row detection and removal
- Outlier detection and handling
- Constant column detection and removal
- Optional normalization for numeric columns
- YAML-based configuration system for preprocessing behavior
- CLI commands for diagnosis, cleaning, display, and config generation
Installation
Install from PyPI:
pip install dataset-doctor
Install from source:
git clone https://github.com/Mirdula18/dataset-doctor.git
cd dataset-doctor
pip install .
Quick Example
import dataset_doctor as dd
report = dd.diagnose("data.csv")
print(report.summary())
clean_df = dd.auto_fix("data.csv")
CLI Usage
Diagnose a dataset:
dataset-doctor diagnose data.csv
Print report output (alias of diagnose):
dataset-doctor report data.csv
Clean a dataset:
dataset-doctor clean data.csv
Clean and write output file:
dataset-doctor clean data.csv --output cleaned.csv
Enable normalization from CLI:
dataset-doctor clean data.csv --normalize
Clean using a YAML config file:
dataset-doctor clean data.csv --config dataset_doctor_config.yaml
Generate a default config file:
dataset-doctor init-config
Display rows:
dataset-doctor display data.csv --rows 10
Show rows (alias of display):
dataset-doctor show data.csv --tail --rows 20 --columns age,salary
Python API
Main API entry points:
- dd.diagnose(dataset)
- dd.auto_fix(dataset, ...)
- dd.display_data(dataset, ...)
Example Usecase
import dataset_doctor as dd
dd.diagnose("data.csv")
dd.auto_fix("data.csv")
dd.auto_fix("data.csv", output_path="cleaned.csv")
dd.auto_fix("data.csv", output="cleaned.csv")
dd.auto_fix("data.csv", do_normalize=True)
dd.auto_fix("data.csv", return_scaler=True)
dd.auto_fix("data.csv", config="dataset_doctor_config.yaml")
dd.auto_fix("data.csv", config={"missing_values": {"numeric_strategy": "mean"}})
dd.display_data("data.csv")
dd.display_data("data.csv", rows=10)
dd.display_data("data.csv", tail=True)
dd.display_data("data.csv", columns=["col1", "col2"])
dd.display_data("data.csv", all_rows=True)
report = dd.diagnose("data.csv")
report.summary()
report.to_dict()
report.print_report()
Example:
import dataset_doctor as dd
# Diagnose
report = dd.diagnose("data.csv")
print(report.summary())
# Auto-clean with options
clean_df = dd.auto_fix(
"data.csv",
output_path="cleaned.csv",
do_normalize=True,
)
Configuration System
Use a YAML file to customize preprocessing behavior.
CLI example:
dataset-doctor clean data.csv --config config.yaml
Python example:
import dataset_doctor as dd
clean_df = dd.auto_fix("data.csv", config="config.yaml")
Example config:
missing_values:
numeric_strategy: median
categorical_strategy: mode
max_missing_threshold: 0.4
duplicates:
remove: true
outliers:
method: iqr
action: clip
normalization:
method: minmax
range: [0, 1]
feature_selection:
remove_constant_columns: true
correlation_threshold: 0.9
logging:
verbosity: medium
Output Example
## DATASET DIAGNOSIS REPORT
Rows: 10000
Columns: 12
### Issues Detected
Missing Values:
- age (12.0%)
- salary (4.0%)
Duplicate Rows:
- 18 rows
Outliers:
- transaction_amount (42 values)
Constant Columns:
- user_flag
Highly Correlated Columns:
- income vs salary (0.97)
License
MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataset_doctor-1.0.0.tar.gz.
File metadata
- Download URL: dataset_doctor-1.0.0.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e72dcaa74a3e0edf15a78c730517e809ea166c81589499622b6953b568fc69e
|
|
| MD5 |
618e7300ed8e8f37f6a5da1b11794fc4
|
|
| BLAKE2b-256 |
e27daf5b9dd9d2d949291246d2aa1c439fe20b07fd0634aefa47a95c06a0a0db
|
File details
Details for the file dataset_doctor-1.0.0-py3-none-any.whl.
File metadata
- Download URL: dataset_doctor-1.0.0-py3-none-any.whl
- Upload date:
- Size: 24.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62e196993cbc04b23c8915ea010b6f76db9d74133ecd3cc94992ed804aa02302
|
|
| MD5 |
477c3996f4758bba7af53ecd0628e65e
|
|
| BLAKE2b-256 |
0cdea47758784005b72d751df9697cd1a0d691314c51971b55f9956faa36ea1b
|