Skip to main content

automated data cleaning tool

Project description

License: MIT

datacleanbot

Automated Data Cleaning Tool. The main goal is to develop a Python tool datacleanbot such that: Given a random parsed raw dataset representing a supervised learning problem, the Python tool is capable of automatically identifying the potential issues and reporting the results and recommendations to the end-user in an effective way.

Install

$ pip install datacleanbot

QuickStart

Install OpenML (version 0.9.0):

OpenML is used to easily import datasets and share models and experiments.

$ pip install openml

For Windows, you need to have C++ Compiler installed.

Acquire data from OpenML:

>>> import openml as oml
>>> data = oml.datasets.get_dataset(id) # id: openml dataset id
>>> X, y, categorical_indicator, features = data.get_data(target=data.default_target_attribute, dataset_format='array')
>>> Xy = np.concatenate((X,y.reshape((y.shape[0],1))), axis=1)

Autoclean data with datacleanbot:

>>> import datacleanbot.dataclean as dc
>>> Xy = dc.autoclean(Xy, data.name, features)

Description

datacleanbot is equipped with the following capabilities:

  • Present an overview report of the given dataset
    • The most important features
    • Statistical information (e.g., mean, max, min)
    • Data types of features
  • Clean common data problems in the raw dataset
    • Duplicated records
    • Inconsistent column names
    • Missing values
    • Outliers

The two aspects datacleanbot meaningfully automates are marked in bold.

User's Guide

The user's guide can be found at datacleanbot.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacleanbot-0.91.tar.gz (14.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacleanbot-0.91-py3-none-any.whl (199.1 kB view details)

Uploaded Python 3

File details

Details for the file datacleanbot-0.91.tar.gz.

File metadata

  • Download URL: datacleanbot-0.91.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.1.post20191125 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.9

File hashes

Hashes for datacleanbot-0.91.tar.gz
Algorithm Hash digest
SHA256 4787106f0acaff10267adc2986030f91d20ac96076ae3660f059b614f54393c2
MD5 a746704471cdd3e71e09b9e592e7f4ce
BLAKE2b-256 145624153ed1dba32d527936920a75344cf45edf70452e1d713506593fa69d36

See more details on using hashes here.

File details

Details for the file datacleanbot-0.91-py3-none-any.whl.

File metadata

  • Download URL: datacleanbot-0.91-py3-none-any.whl
  • Upload date:
  • Size: 199.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.1.post20191125 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.9

File hashes

Hashes for datacleanbot-0.91-py3-none-any.whl
Algorithm Hash digest
SHA256 b95fab5b0f1975bedab8fbe30e75723850ba8ca6a30038a72214c4173e4d67d7
MD5 260463a44669ec85cbba30ef3cb1f625
BLAKE2b-256 e56f3e675b8e7ec7686f5510746f08754a930f5338b98239595e76163a5ca279

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page