Skip to main content

openclean Metanome Python Package

Project description

https://img.shields.io/pypi/pyversions/openclean-metanome.svg https://badge.fury.io/py/openclean-metanome.svg https://img.shields.io/badge/License-BSD-green.svg https://github.com/VIDA-NYU/openclean-metanome/workflows/build/badge.svg Documentation Status https://codecov.io/gh/VIDA-NYU/openclean-metanome/branch/master/graph/badge.svg?token=VL43CKXZEF
openclean Logo

About

This package is an extension for the openclean-core package. It provides access to data profiling algorithms from the Metanome project in openclean. The algorithms themselves are executable via the Metanome Wrapper that enables to run Metanome algorithms via the command line.

Installation & Configuration

The package can be installed using pip.

pip install openclean-metanome

The openclean-metanome package uses flowServ to run Metanome algorithms as serial workflows in openclean. flowServ supports two modes of execution: (1) using the Python sub-process package, and (2) using Docker.

Python Sub-Process

When running Metanome algorithms as Python sub-processes you need to have an installation of the Jave Runtime Environment (Version 8 or higher) on your local machine. You also need a local copy of the Metanome.jar wrapper. The file can be downloaded from Zenodo <https://zenodo.org/record/4604964#.YE9tif4pBH4>`_. The package also provides the option to download the file from within your Python scripts.

from openclean_metanome.download import download_jar

download_jar(verbose=True)

The example will download the jar file into the default directory (defined via the METANOME_JARPATH environment variable). If the variable is not set, the users default cache folder is used. Note that the Metanome.jar is currently about 75 MB in size. Make sure that the environment variable METANOME_JARPATH contains a reference to the downloaded jar-file if you did not download the file into the default location.

Docker

If you have Docker installed on your machine you can run Metanome using the provided Docker container image. To do so, make sure that the environment variable METANOME_WORKER references the configuration file docker_worker.yaml that is included in the config folder of this repository.

Algorithms

The package currently supports two data profiling algorithms.

HyFD

The HyFD algorithm (A Hybrid Approach to Functional Dependency Discovery) is a functional dependency discovery algorithm. Details about the algorithm can be found in:

Thorsten Papenbrock, Felix Naumann
A Hybrid Approach to Functional Dependency Discovery
ACM International Conference on Management of Data (SIGMOD '16)

For an example of how to use the algorithm in openclean have a look at the example notebook Run HyFD Algorithm - Example.

HyUCC

The HyUCC algorithm (A Hybrid Approach for Efficient Unique Column Combination Discovery) is a unique column combination discovery. Details about the algorithm can be found here.

For an example of how to use the algorithm in openclean have a look at the example notebook Run HyUCC Algorithm - Example.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openclean-metanome-0.2.0.tar.gz (16.4 kB view hashes)

Uploaded Source

Built Distribution

openclean_metanome-0.2.0-py3-none-any.whl (16.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page