Skip to main content

Automatically diagnose and clean messy datasets for machine learning and data science.

Project description

dataset-doctor

Automatic Dataset Diagnosis and Cleaning for Machine Learning

dataset-doctor helps you quickly identify and fix common dataset quality problems before model training.

It is built for:

  • Data scientists preparing tabular data for experiments
  • ML engineers standardizing preprocessing workflows
  • Beginners who want safer defaults for dataset cleaning

Instead of manually writing repeated preprocessing code, you can diagnose data issues and run an automatic cleaning pipeline from either Python or the command line.

Features

  • Dataset diagnosis with a readable summary report
  • Missing value detection and imputation
  • Duplicate row detection and removal
  • Outlier detection and handling
  • Constant column detection and removal
  • Optional normalization for numeric columns
  • YAML-based configuration system for preprocessing behavior
  • CLI commands for diagnosis, cleaning, display, and config generation

Installation

Install from PyPI:

pip install dataset-doctor

Install from source:

git clone https://github.com/Mirdula18/dataset-doctor.git
cd dataset-doctor
pip install .

Quick Example

import dataset_doctor as dd

report = dd.diagnose("data.csv")
print(report.summary())

clean_df = dd.auto_fix("data.csv")

CLI Usage

Diagnose a dataset:

dataset-doctor diagnose data.csv

Print report output (alias of diagnose):

dataset-doctor report data.csv

Clean a dataset:

dataset-doctor clean data.csv

Clean and write output file:

dataset-doctor clean data.csv --output cleaned.csv

Enable normalization from CLI:

dataset-doctor clean data.csv --normalize

Clean using a YAML config file:

dataset-doctor clean data.csv --config dataset_doctor_config.yaml

Generate a default config file:

dataset-doctor init-config

Display rows:

dataset-doctor display data.csv --rows 10

Show rows (alias of display):

dataset-doctor show data.csv --tail --rows 20 --columns age,salary

Python API

Main API entry points:

  • dd.diagnose(dataset)
  • dd.auto_fix(dataset, ...)
  • dd.display_data(dataset, ...)

Example Usecase

import dataset_doctor as dd

dd.diagnose("data.csv")
dd.auto_fix("data.csv")
dd.auto_fix("data.csv", output_path="cleaned.csv")
dd.auto_fix("data.csv", output="cleaned.csv")
dd.auto_fix("data.csv", do_normalize=True)
dd.auto_fix("data.csv", return_scaler=True)
dd.auto_fix("data.csv", config="dataset_doctor_config.yaml")
dd.auto_fix("data.csv", config={"missing_values": {"numeric_strategy": "mean"}})
dd.display_data("data.csv")
dd.display_data("data.csv", rows=10)
dd.display_data("data.csv", tail=True)
dd.display_data("data.csv", columns=["col1", "col2"])
dd.display_data("data.csv", all_rows=True)

report = dd.diagnose("data.csv")
report.summary()
report.to_dict()
report.print_report()

Example:

import dataset_doctor as dd

# Diagnose
report = dd.diagnose("data.csv")
print(report.summary())

# Auto-clean with options
clean_df = dd.auto_fix(
    "data.csv",
    output_path="cleaned.csv",
    do_normalize=True,
)

Configuration System

Use a YAML file to customize preprocessing behavior.

CLI example:

dataset-doctor clean data.csv --config config.yaml

Python example:

import dataset_doctor as dd

clean_df = dd.auto_fix("data.csv", config="config.yaml")

Example config:

missing_values:
  numeric_strategy: median
  categorical_strategy: mode
  max_missing_threshold: 0.4

duplicates:
  remove: true

outliers:
  method: iqr
  action: clip

normalization:
  method: minmax
  range: [0, 1]

feature_selection:
  remove_constant_columns: true
  correlation_threshold: 0.9

logging:
  verbosity: medium

Output Example

## DATASET DIAGNOSIS REPORT

Rows: 10000
Columns: 12

### Issues Detected

Missing Values:
- age (12.0%)
- salary (4.0%)

Duplicate Rows:
- 18 rows

Outliers:
- transaction_amount (42 values)

Constant Columns:
- user_flag

Highly Correlated Columns:
- income vs salary (0.97)

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_doctor-1.0.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataset_doctor-1.0.0-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file dataset_doctor-1.0.0.tar.gz.

File metadata

  • Download URL: dataset_doctor-1.0.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dataset_doctor-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4e72dcaa74a3e0edf15a78c730517e809ea166c81589499622b6953b568fc69e
MD5 618e7300ed8e8f37f6a5da1b11794fc4
BLAKE2b-256 e27daf5b9dd9d2d949291246d2aa1c439fe20b07fd0634aefa47a95c06a0a0db

See more details on using hashes here.

File details

Details for the file dataset_doctor-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: dataset_doctor-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for dataset_doctor-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 62e196993cbc04b23c8915ea010b6f76db9d74133ecd3cc94992ed804aa02302
MD5 477c3996f4758bba7af53ecd0628e65e
BLAKE2b-256 0cdea47758784005b72d751df9697cd1a0d691314c51971b55f9956faa36ea1b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page