Science-intensive high-performance data profiler
Project description
Desbordante: high-performance data profiler
What is it?
Desbordante is a high-performance data profiler oriented towards exploratory data analysis
Try the web version at https://desbordante.unidata-platform.ru/
Table of Contents
Main Features
Desbordante can discover and validate a range of data patterns, such as:
- Functional dependencies, both exact and approximate (discovery and validation)
- Metric functional dependencies (validation)
- Fuzzy algebraic constraints (discovery)
- Unique column combinations (discovery and validation)
- Association rules (discovery)
This package uses the library of the Desbordante platform, which is written in C++. This means that depending on the algorithm and dataset, the runtimes may be cut by 2-10 times compared to the alternatives.
Usage examples
- Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the FD discovery algorithm HyFD is used.
import desbordante
TABLE = 'examples/datasets/university_fd.csv'
algo = desbordante.HyFD()
algo.load_data(TABLE, ',', True)
algo.execute()
result = algo.get_fds()
print('FDs:')
for fd in result:
print(fd)
FDs:
( 1 3 ) -> 0
( 1 3 ) -> 2
( 0 ) -> 2
( 0 3 ) -> 1
( 2 ) -> 0
( 2 3 ) -> 1
- Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used.
import desbordante
TABLE = 'examples/datasets/inventory_afd.csv'
ERROR = 0.1
algo = desbordante.Pyro()
algo.load_data(TABLE, ',', True)
algo.execute(error=ERROR)
result = algo.get_fds()
print('AFDs:')
for fd in result:
print(fd)
AFDs:
( 0 ) -> 1
( 0 ) -> 2
( 1 ) -> 2
- Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.
import desbordante
TABLE = 'examples/datasets/theatres_mfd.csv'
METRIC = 'euclidean'
LHS_INDICES = [0]
RHS_INDICES = [2]
PARAMETER = 5
algo = desbordante.MetricVerifier()
algo.load_data(TABLE, ',', True)
algo.execute(lhs_indices=LHS_INDICES, metric=METRIC,
parameter=PARAMETER, rhs_indices=RHS_INDICES)
if algo.mfd_holds():
print('MFD holds')
else:
print('MFD does not hold')
MFD holds
- Discover approximate functional dependencies with various error thresholds. Here, we showcase the preferred approach to configuring algorithm options. Furthermore, we are using a pandas dataframe to load data from a CSV file.
>>> import desbordante
>>> import pandas as pd
>>> pyro = desbordante.Pyro()
>>> df = pd.read_csv('iris.csv', sep=',', header=0)
>>> pyro.load_data(df)
>>> pyro.execute(error=0.0)
>>> pyro.get_fds()
[( 0 1 2 ) -> 4, ( 0 2 3 ) -> 4, ( 0 1 3 ) -> 4, ( 1 2 3 ) -> 4]
>>> pyro.execute(error=0.1)
>>> pyro.get_fds()
[( 2 ) -> 0, ( 2 ) -> 1, ( 0 ) -> 2, ( 2 ) -> 4, ( 2 ) -> 3, ( 3 ) -> 2, ( 3 ) -> 0, ( 0 ) -> 1, ( 0 ) -> 3, ( 1 ) -> 0, ( 1 ) -> 2, ( 3 ) -> 4, ( 3 ) -> 1, ( 1 ) -> 3, ( 0 ) -> 4, ( 1 ) -> 4]
>>> pyro.execute(error=0.2)
>>> pyro.get_fds()
[( 2 ) -> 1, ( 2 ) -> 0, ( 2 ) -> 4, ( 0 ) -> 2, ( 2 ) -> 3, ( 0 ) -> 1, ( 3 ) -> 4, ( 3 ) -> 2, ( 3 ) -> 1, ( 3 ) -> 0, ( 1 ) -> 2, ( 0 ) -> 3, ( 0 ) -> 4, ( 1 ) -> 0, ( 1 ) -> 4, ( 1 ) -> 3]
>>> pyro.execute(error=0.3)
>>> pyro.get_fds()
[( 2 ) -> 1, ( 0 ) -> 2, ( 2 ) -> 0, ( 3 ) -> 0, ( 2 ) -> 3, ( 1 ) -> 0, ( 2 ) -> 4, ( 3 ) -> 2, ( 0 ) -> 1, ( 1 ) -> 2, ( 3 ) -> 1, ( 3 ) -> 4, ( 0 ) -> 3, ( 4 ) -> 2, ( 4 ) -> 1, ( 0 ) -> 4, ( 1 ) -> 3, ( 1 ) -> 4, ( 4 ) -> 3]
More examples can be found in the Desbordante repository on GitHub
Installation
The source code is currently hosted on GitHub at https://github.com/Mstrutov/Desbordante
Wheels for the latest released version are available at the Python Package Index (PyPI).
Currently only manylinux2014 (Ubuntu 20.04+, or any other linux distribution with gcc 10+) is supported.
$ pip install desbordante
Installation from sources
Install all dependencies listed in README.md.
Then, in the Desbordante directory (the same one that contains this file), execute:
./build.sh
python3 -m venv venv
source venv/bin/activate
python3 -m pip install .
Cite
If you use this software for research, please cite one of our papers:
- George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023).
- George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler ( preprint)". CoRR abs/2301.05965. (2023).
- M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.
- A. Smirnov, A. Chizhov, I. Shchuckin, N. Bobrov and G. Chernishev, "Fast Discovery of Inclusion Dependencies with Desbordante," 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 2023, pp. 264-275, doi: 10.23919/FRUCT58615.2023.10143047.
Contacts and Q&A
If you have any questions regarding the tool usage you can ask it in our google group. To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file desbordante-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c466f724ffbe87cb57701d17d09d2c7a7ceb23cb0c61ea2bfe9c0e42ea589260 |
|
MD5 | bb7394fb38533168e5a9075dead20cf3 |
|
BLAKE2b-256 | d1a40357e22abc1ce5b5875aba33ca49a5da899077bddfa9859ee0552b612fde |
File details
Details for the file desbordante-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 31b13ea4388ed69d8f87963d246e58065ddcdf24635a519be1f385dd8651d81e |
|
MD5 | 25cf2684c24662d562108ff8d72ee7b4 |
|
BLAKE2b-256 | ff9ecdb5ba4c4a04588619f15f07cfd07a22a096db5a9975882d4aaf23dcb50d |
File details
Details for the file desbordante-1.0.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-1.0.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c9535455c1023d3c774ac0f5c5d943e2870875b26a5afe331a4b850ae0c91567 |
|
MD5 | 3553d3a428dcccc64e2d80eca4b553c0 |
|
BLAKE2b-256 | fa6732a8345a640a395d7030afc5c2728c5356edc71dffaf2fccfae1ad722828 |
File details
Details for the file desbordante-1.0.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-1.0.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52bbd7e855a28fc8da33e37921b96bd42d960de738d4494104a960c709be61d8 |
|
MD5 | 9498685f2fe364933b50de1e72b8368e |
|
BLAKE2b-256 | e66fd3334ac4866271ca2a2ad96f8b861c021a102f964b5e70adfd1333ac5471 |
File details
Details for the file desbordante-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4e0f9654bed12000e248e745ceaedc183ba9b5343e007f8c5e5647b043ba6b3 |
|
MD5 | 998197c030042fc4e197ffa33c46f21d |
|
BLAKE2b-256 | 648ed22d79da2a61873113d6cd3e16dd094268d4282e8e36ccd705f96ba915a6 |
File details
Details for the file desbordante-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0860a10409e8581e0ebba84ac9aefffbdcba6443a4fd4722d39a4e51070dc4ff |
|
MD5 | 87617055ca2eb9dd90f0a3146a22b15d |
|
BLAKE2b-256 | 14bdf59a21026fd7178087b184f329daa7dfc39d13a0b105dce001ffa76607d3 |
File details
Details for the file desbordante-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 17ccabc4577e84499de5ff0d17b8c8a2097b1ff73d1f6d84af33e837f7546a8c |
|
MD5 | 389dc36b2ee6c79dd85c460306db650a |
|
BLAKE2b-256 | 271540d78135159c0c0c99b47f5a196a960660d8c9ce856e43b3b3c243eedb3f |
File details
Details for the file desbordante-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c844eb5086a8d85dae7ba4096daee9689b7b09ddd1c0fe9a47a786a04ef8adc1 |
|
MD5 | 3381384c1dc0df92c3ccf0d1be7c01f3 |
|
BLAKE2b-256 | 052faeca01ba34f0be20ff7bdc4c9642da4c2ed8a0b9a5fbb0396797e20a3289 |
File details
Details for the file desbordante-1.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-1.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fdfa383d7fe009ef00c481bc6bd9e5c154a38471c14f321d343bf5954904110c |
|
MD5 | 18d0d1e8e3b4d294d8eceb0ea321684a |
|
BLAKE2b-256 | e192b41da90b560927db00031b24a01b3588c2108bb8ee66d9784a23db889d61 |
File details
Details for the file desbordante-1.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-1.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a3d202cf8a8f12dbfbd0d8569b075ac9b5f43949d583aa067f4622fdbd41b09 |
|
MD5 | 86b585783d6a6bb7a4a87242911fc1e7 |
|
BLAKE2b-256 | eac28784074b1277b0ecdef31bc59f36fa2f9930e8e8f79452eb2f83f5982e72 |