Skip to main content

Library for data cleaning and data profiling

Project description

https://img.shields.io/pypi/pyversions/openclean.svg https://badge.fury.io/py/openclean.svg https://img.shields.io/badge/License-BSD-green.svg
openclean Logo

About

openclean is a Python library for data profiling and data cleaning. The project is motivated by the fact that data preparation is still a major bottleneck for many data science projects. Data preparation requires profiling to gain an understanding of data quality issues, and data manipulation to transform the data into a form that is fit for the intended purpose.

While a large number of different tools and techniques have previously been developed for profiling and cleaning data, one main issue that we see with these tools is the lack of access to them in a single (unified) framework. Existing tools may be implemented in different programming languages and require significant effort to install and interface with. In other cases, promising data cleaning methods have been published in the scientific literature but there is no suitable codebase available for them. We believe that the lack of seamless access to existing work is a major contributor to why data preparation is so time consuming.

The goal of openclean is to bring together data cleaning tools in a single environment that is easy and intuitive to use for a data scientist. openclean allows users to compose and execute cleaning pipelines that are built using a variety of different tools. We aim for openclean to be flexible and extensible to allow easy integration of new functionality. To this end, we define a set of primitives and API’s for the different types of operators (actors) in openclean pipelines.

Features

openclean has many features that make the data wrangling experience straightforward. It shines particularly in these areas:

Data Profiling

openclean comes with a profiler to provide users actionable metrics about their data’s quality. It allows users to detect possible problems early on by providing various statistical measures of the data from min-max frequencies, to uniqueness and entropy calculations. The interface is easy to implement and can be extended by python savvy users to cater their needs.

Data Cleaning & Wrangling

openclean’s operators have been created specifically to handle data janitorial tasks. They help identify and present statistical anomalies, fix functional dependency violations, locate and update spelling mistakes, and handle missing values gracefully. As openclean is growing fast, so is this list of operators!

Data Enrichment

openclean seamlessly integrates with Socrata and Reference Data Repository to provide it’s users master datasets which can be incorporated in the data cleaning process.

Data Provenance

openclean comes with a mini-version control engine that allows users to maintain versions of their datasets and at any point commit, checkout or rollback changes. Not only this, users can register custom functions inside the openclean engine and apply them effortlessly across different datasets/notebooks.

Installation

Install openclean from PyPI using pip with:

pip install openclean

You can also install the different openclean extensions openclean-geo, openclean-metanome, openclean-notebook, and openclean-pattern, or install openclean with all the extensions:

pip install openclean[full]

Note: See the Demo section below for instructions to run the example notebooks in this repository.

Usage

We include several example notebooks in this repository that demonstrate possible use cases for openclean. We recommend starting with the documentation or the New York City Restaurant Inspection Results notebook. In that example our goal is to reproduce a previous study from 2014 that looks at the distribution of restaurant inspection grades in New York City. For our study, we use data that was downloaded in Sept. 2019. The example is split into two different Jupyter notebooks:

Other examples along with the datasets are located in the examples’ folder

Demo

Use the following steps to setup and run the example notebooks in this repository. We recommend using a virtual environment. Below are two examples for setting up a virtual environment

# -- Create a new virtual environment
virtualenv venv
# -- Activate the virtual environment
source venv/bin/activate

If you are using the Python distribution from Anaconda, you can setup an environment like this:

# -- Create a new virtual environment
conda create -n openclean pip
# -- Activate the virtual environment
conda activate openclean

After activating your virtual environment, follow these steps to setup and run the notebook examples :

# Clone the openclean repository into your current working directory.
git clone git@github.com:VIDA-NYU/openclean.git
# Change working directory to the cloned repository.
cd openclean
# Install openclean and dependencies required for the demo
pip install .[demo]
# Run Jupyter (the navigate to the notebooks in folder `examples/notebooks`)
jupyter notebook

Demo Video

Want to see openclean in action? Check out our video demo: https://youtu.be/HNmNB6YMgHk

Documentation

The official documentation is hosted on readthedocs: http://openclean.readthedocs.io/

You can also read more about openclean in this blog post (on GitHub and Towards Data Science).

Contributing

We welcome all contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas.

A detailed overview on how to contribute can be found here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openclean-0.2.0.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

openclean-0.2.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file openclean-0.2.0.tar.gz.

File metadata

  • Download URL: openclean-0.2.0.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for openclean-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a52085ff51d3e7eb0c3e2dc89f79b2d885b7c550cda9edbcb25f36322ef04e6c
MD5 118bc5650212e39a49a19087d4b5e23f
BLAKE2b-256 dd5e780ed0c2ebe356ed368c3e988341d07bee6af296bc7412aaa53c37a6a40e

See more details on using hashes here.

File details

Details for the file openclean-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: openclean-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for openclean-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5dbaea4e3885f94bf95a46ecd11547c424c8ebd95bcfae0c7eab3707a474dc52
MD5 e9eed6da74d2c420e3604c7d2bc1f59e
BLAKE2b-256 fa0b3e8445a50e1f0616c8a107b4eff89cdb8ee4c18959ee3a540455c9272684

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page