Skip to main content

Automatically diagnose and clean messy datasets for machine learning and data science.

Project description

dataset-doctor

Automatic Dataset Diagnosis and Cleaning for Machine Learning

dataset-doctor helps you quickly identify and fix common dataset quality problems before model training.

It is built for:

  • Data scientists preparing tabular data for experiments
  • ML engineers standardizing preprocessing workflows
  • Beginners who want safer defaults for dataset cleaning

Instead of manually writing repeated preprocessing code, you can diagnose data issues and run an automatic cleaning pipeline from either Python or the command line.

Features

  • Dataset diagnosis with a readable summary report
  • Missing value detection and imputation
  • Duplicate row detection and removal
  • Outlier detection and handling
  • Constant column detection and removal
  • Optional normalization for numeric columns
  • YAML-based configuration system for preprocessing behavior
  • CLI commands for diagnosis, cleaning, display, and config generation

Installation

Install from PyPI:

pip install dataset-doctor

Install from source:

git clone https://github.com/Mirdula18/dataset-doctor.git
cd dataset-doctor
pip install .

Quick Example

import dataset_doctor as dd

report = dd.diagnose("data.csv")
print(report.summary())

clean_df = dd.auto_fix("data.csv")

CLI Usage

Diagnose a dataset:

dataset-doctor diagnose data.csv

Print report output (alias of diagnose):

dataset-doctor report data.csv

Clean a dataset:

dataset-doctor clean data.csv

Clean and write output file:

dataset-doctor clean data.csv --output cleaned.csv

Enable normalization from CLI:

dataset-doctor clean data.csv --normalize

Clean using a YAML config file:

dataset-doctor clean data.csv --config dataset_doctor_config.yaml

Generate a default config file:

dataset-doctor init-config

Display rows:

dataset-doctor display data.csv --rows 10

Show rows (alias of display):

dataset-doctor show data.csv --tail --rows 20 --columns age,salary

Python API

Main API entry points:

  • dd.diagnose(dataset)
  • dd.auto_fix(dataset, ...)
  • dd.display_data(dataset, ...)

Example Usecase

import dataset_doctor as dd

dd.diagnose("data.csv")
dd.auto_fix("data.csv")
dd.auto_fix("data.csv", output_path="cleaned.csv")
dd.auto_fix("data.csv", output="cleaned.csv")
dd.auto_fix("data.csv", do_normalize=True)
dd.auto_fix("data.csv", return_scaler=True)
dd.auto_fix("data.csv", config="dataset_doctor_config.yaml")
dd.auto_fix("data.csv", config={"missing_values": {"numeric_strategy": "mean"}})
dd.display_data("data.csv")
dd.display_data("data.csv", rows=10)
dd.display_data("data.csv", tail=True)
dd.display_data("data.csv", columns=["col1", "col2"])
dd.display_data("data.csv", all_rows=True)

report = dd.diagnose("data.csv")
report.summary()
report.to_dict()
report.print_report()

Example:

import dataset_doctor as dd

# Diagnose
report = dd.diagnose("data.csv")
print(report.summary())

# Auto-clean with options
clean_df = dd.auto_fix(
    "data.csv",
    output_path="cleaned.csv",
    do_normalize=True,
)

Configuration System

Use a YAML file to customize preprocessing behavior.

CLI example:

dataset-doctor clean data.csv --config config.yaml

Python example:

import dataset_doctor as dd

clean_df = dd.auto_fix("data.csv", config="config.yaml")

Example config:

missing_values:
  numeric_strategy: median
  categorical_strategy: mode
  max_missing_threshold: 0.4

duplicates:
  remove: true

outliers:
  method: iqr
  action: clip

normalization:
  method: minmax
  range: [0, 1]

feature_selection:
  remove_constant_columns: true
  correlation_threshold: 0.9

logging:
  verbosity: medium

Output Example

## DATASET DIAGNOSIS REPORT

Rows: 10000
Columns: 12

### Issues Detected

Missing Values:
- age (12.0%)
- salary (4.0%)

Duplicate Rows:
- 18 rows

Outliers:
- transaction_amount (42 values)

Constant Columns:
- user_flag

Highly Correlated Columns:
- income vs salary (0.97)

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_doctor-1.0.1.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataset_doctor-1.0.1-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file dataset_doctor-1.0.1.tar.gz.

File metadata

  • Download URL: dataset_doctor-1.0.1.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dataset_doctor-1.0.1.tar.gz
Algorithm Hash digest
SHA256 97cb93b06d9e19976005063b2720e6ae6a0c571871912d4228f1c9cea6118a2d
MD5 4bf10e3ab325ef2e366c030f440d8eb7
BLAKE2b-256 58f35f911e6405c7d9d6ac865917ae2ba48efcef0fd4954e9174f5fa6f755744

See more details on using hashes here.

File details

Details for the file dataset_doctor-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: dataset_doctor-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dataset_doctor-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8f65465c5032fac926ba8c0a006502f91052a55614a475fe9a89fed206013e2b
MD5 8ec9892f2efa95d4b7ed003b27dae117
BLAKE2b-256 03a18734eeb7a198f3a70e2bd9f6bbcb0958a13f55bd929cc0840b19d6050332

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page