openclean Metanome Python Package
Project description
About
This package is an extension for the openclean-core package. It provides access to data profiling algorithms from the Metanome project in openclean. The algorithms themselves are executable via the Metanome Wrapper that enables to run Metanome algorithms via the command line.
Installation & Configuration
The package can be installed using pip.
pip install openclean-metanome
The openclean-metanome package uses flowServ to run Metanome algorithms as serial workflows in openclean. flowServ supports two modes of execution: (1) using the Python sub-process package, and (2) using Docker.
Python Sub-Process
When running Metanome algorithms as Python sub-processes you need to have an installation of the Jave Runtime Environment (Version 8 or higher) on your local machine. You also need a local copy of the Metanome.jar wrapper. The file can be downloaded from Zenodo <https://zenodo.org/record/4604964#.YE9tif4pBH4>`_. The package also provides the option to download the file from within your Python scripts.
from openclean_metanome.download import download_jar
download_jar(verbose=True)
The example will download the jar file into the default directory (defined via the METANOME_JARPATH environment variable). If the variable is not set, the users default cache folder is used. Note that the Metanome.jar is currently about 75 MB in size. Make sure that the environment variable METANOME_JARPATH contains a reference to the downloaded jar-file if you did not download the file into the default location.
Docker
If you have Docker installed on your machine you can run Metanome using the provided Docker container image. To do so, make sure that the environment variable METANOME_WORKER references the configuration file docker_worker.yaml that is included in the config folder of this repository.
Algorithms
The package currently supports two data profiling algorithms.
HyFD
The HyFD algorithm (A Hybrid Approach to Functional Dependency Discovery) is a functional dependency discovery algorithm. Details about the algorithm can be found in:
Thorsten Papenbrock, Felix Naumann A Hybrid Approach to Functional Dependency Discovery ACM International Conference on Management of Data (SIGMOD '16)
For an example of how to use the algorithm in openclean have a look at the example notebook Run HyFD Algorithm - Example.
HyUCC
The HyUCC algorithm (A Hybrid Approach for Efficient Unique Column Combination Discovery) is a unique column combination discovery. Details about the algorithm can be found here.
For an example of how to use the algorithm in openclean have a look at the example notebook Run HyUCC Algorithm - Example.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file openclean-metanome-0.2.0.tar.gz
.
File metadata
- Download URL: openclean-metanome-0.2.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20734847c55003bb9596daaa1bc3e73df5dd431ccae10b836bfcd6862f94a350 |
|
MD5 | 186d19325c7c7ee97d92d5fcaac49d39 |
|
BLAKE2b-256 | 3a56ffe107562aa427ba3f09d62f6138b333c4d5fd9ac783cf662a7434be5315 |
File details
Details for the file openclean_metanome-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: openclean_metanome-0.2.0-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1ff2c6160b5dcb2e6fbd7a6068850da9cb09bde71983a5365b5f7482cf8338a |
|
MD5 | b3022471467bcb9e722f6e8a6b56edbc |
|
BLAKE2b-256 | 32ac7f88b3aa07b22463453fc45523b38c952d08bad5015ce08d76fff9d44b2a |