Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

automated data cleaning tool

Project description

License: MIT

datacleanbot

Automated Data Cleaning Tool. The main goal is to develop a Python tool datacleanbot such that: Given a random parsed raw dataset representing a supervised learning problem, the Python tool is capable of automatically identifying the potential issues and reporting the results and recommendations to the end-user in an effective way.

Install

$ pip install datacleanbot

QuickStart

Acquire data from OpenML:

>>> import openml as oml
>>> data = oml.datasets.get_dataset(id) # id: openml dataset id
>>> X, y, features = data.get_data(target=data.default_target_attribute, return_attribute_names=True)
>>> Xy = data.get_data()

Autoclean data with datacleanbot

>>> import datacleanbot.dataclean as dc
>>> Xy = dc.autoclean(Xy, data.name, features)

Description

datacleanbot is equipped with the following capabilities:

  • Present an overview report of the given dataset
    • The most important features
    • Statistical information (e.g., mean, max, min)
    • Data types of features
  • Clean common data problems in the raw dataset
    • Duplicated records
    • Inconsistent column names
    • Missing values
    • Outliers

The three aspects datacleanbot meaningfully automates are marked in bold.

User's Guide

The user's guide can be found at datacleanbot.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for datacleanbot, version 0.4
Filename, size File type Python version Upload date Hashes
Filename, size datacleanbot-0.4-py3-none-any.whl (197.9 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size datacleanbot-0.4.tar.gz (151.3 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page