Automatically diagnose and clean messy datasets for machine learning and data science.

These details have not been verified by PyPI

Project links

Project description

dataset-doctor

Automatic Dataset Diagnosis and Cleaning for Machine Learning

dataset-doctor helps you quickly identify and fix common dataset quality problems before model training.

It is built for:

Data scientists preparing tabular data for experiments
ML engineers standardizing preprocessing workflows
Beginners who want safer defaults for dataset cleaning

Instead of manually writing repeated preprocessing code, you can diagnose data issues and run an automatic cleaning pipeline from either Python or the command line.

Features

Dataset diagnosis with a readable summary report
Missing value detection and imputation
Duplicate row detection and removal
Outlier detection and handling
Constant column detection and removal
Optional normalization for numeric columns
YAML-based configuration system for preprocessing behavior
CLI commands for diagnosis, cleaning, display, and config generation

Installation

Install from PyPI:

pip install dataset-doctor

Install from source:

git clone https://github.com/Mirdula18/dataset-doctor.git
cd dataset-doctor
pip install .

Quick Example

import dataset_doctor as dd

report = dd.diagnose("data.csv")
print(report.summary())

clean_df = dd.auto_fix("data.csv")

CLI Usage

Diagnose a dataset:

dataset-doctor diagnose data.csv

Print report output (alias of diagnose):

dataset-doctor report data.csv

Clean a dataset:

dataset-doctor clean data.csv

Clean and write output file:

dataset-doctor clean data.csv --output cleaned.csv

Enable normalization from CLI:

dataset-doctor clean data.csv --normalize

Clean using a YAML config file:

dataset-doctor clean data.csv --config dataset_doctor_config.yaml

Generate a default config file:

dataset-doctor init-config

Display rows:

dataset-doctor display data.csv --rows 10

Show rows (alias of display):

dataset-doctor show data.csv --tail --rows 20 --columns age,salary

Python API

Main API entry points:

dd.diagnose(dataset)
dd.auto_fix(dataset, ...)
dd.display_data(dataset, ...)

Example Usecase

import dataset_doctor as dd

dd.diagnose("data.csv")
dd.auto_fix("data.csv")
dd.auto_fix("data.csv", output_path="cleaned.csv")
dd.auto_fix("data.csv", output="cleaned.csv")
dd.auto_fix("data.csv", do_normalize=True)
dd.auto_fix("data.csv", return_scaler=True)
dd.auto_fix("data.csv", config="dataset_doctor_config.yaml")
dd.auto_fix("data.csv", config={"missing_values": {"numeric_strategy": "mean"}})
dd.display_data("data.csv")
dd.display_data("data.csv", rows=10)
dd.display_data("data.csv", tail=True)
dd.display_data("data.csv", columns=["col1", "col2"])
dd.display_data("data.csv", all_rows=True)

report = dd.diagnose("data.csv")
report.summary()
report.to_dict()
report.print_report()

Example:

import dataset_doctor as dd

# Diagnose
report = dd.diagnose("data.csv")
print(report.summary())

# Auto-clean with options
clean_df = dd.auto_fix(
    "data.csv",
    output_path="cleaned.csv",
    do_normalize=True,
)

Configuration System

Use a YAML file to customize preprocessing behavior.

CLI example:

dataset-doctor clean data.csv --config config.yaml

Python example:

import dataset_doctor as dd

clean_df = dd.auto_fix("data.csv", config="config.yaml")

Example config:

missing_values:
  numeric_strategy: median
  categorical_strategy: mode
  max_missing_threshold: 0.4

duplicates:
  remove: true

outliers:
  method: iqr
  action: clip

normalization:
  method: minmax
  range: [0, 1]

feature_selection:
  remove_constant_columns: true
  correlation_threshold: 0.9

logging:
  verbosity: medium

Output Example

## DATASET DIAGNOSIS REPORT

Rows: 10000
Columns: 12

### Issues Detected

Missing Values:
- age (12.0%)
- salary (4.0%)

Duplicate Rows:
- 18 rows

Outliers:
- transaction_amount (42 values)

Constant Columns:
- user_flag

Highly Correlated Columns:
- income vs salary (0.97)

License

MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Mar 25, 2026

This version

1.0.0

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_doctor-1.0.0.tar.gz (22.1 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataset_doctor-1.0.0-py3-none-any.whl (24.6 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file dataset_doctor-1.0.0.tar.gz.

File metadata

Download URL: dataset_doctor-1.0.0.tar.gz
Upload date: Mar 25, 2026
Size: 22.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dataset_doctor-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`4e72dcaa74a3e0edf15a78c730517e809ea166c81589499622b6953b568fc69e`
MD5	`618e7300ed8e8f37f6a5da1b11794fc4`
BLAKE2b-256	`e27daf5b9dd9d2d949291246d2aa1c439fe20b07fd0634aefa47a95c06a0a0db`

See more details on using hashes here.

File details

Details for the file dataset_doctor-1.0.0-py3-none-any.whl.

File metadata

Download URL: dataset_doctor-1.0.0-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 24.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dataset_doctor-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`62e196993cbc04b23c8915ea010b6f76db9d74133ecd3cc94992ed804aa02302`
MD5	`477c3996f4758bba7af53ecd0628e65e`
BLAKE2b-256	`0cdea47758784005b72d751df9697cd1a0d691314c51971b55f9956faa36ea1b`

See more details on using hashes here.

dataset-doctor 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dataset-doctor

Automatic Dataset Diagnosis and Cleaning for Machine Learning

Features

Installation

Quick Example

CLI Usage

Python API

Example Usecase

Configuration System

Output Example

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes