Science-intensive high-performance data profiler
Project description
Desbordante: high-performance data profiler
What is it?
Desbordante is a high-performance data profiler oriented towards exploratory data analysis
Try the web version at https://desbordante.unidata-platform.ru/
Table of Contents
- Main Features
- Usage Examples
- I still don't understand how to use Desbordante and patterns :(
- Installation
- Installation from sources
- Troubleshooting
- Cite
- Contacts and Q&A
Main Features
Desbordante is a high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms.
The Discovery task is designed to identify all instances of a specified pattern type of a given dataset.
The Validation task is different: it is designed to check whether a specified pattern instance is present in a given dataset. This task not only returns True or False, but it also explains why the instance does not hold (e.g. it can list table rows with conflicting values).
The currently supported data patterns are:
- Functional dependency variants:
- Exact functional dependencies (discovery and validation)
- Approximate functional dependencies, with g1 metric (discovery and validation)
- Probabilistic functional dependencies, with PerTuple and PerValue metrics (discovery)
- Graph functional dependencies (validation)
- Conditional functional dependencies (discovery)
- Inclusion dependencies (discovery)
- Order dependencies:
- set-based axiomatization (discovery)
- list-based axiomatization (discovery)
- Metric functional dependencies (validation)
- Fuzzy algebraic constraints (discovery)
- Unique column combinations:
- Exact unique column combination (discovery and validation)
- Approximate unique column combination, with g1 metric (discovery and validation)
- Association rules (discovery)
This package uses the library of the Desbordante platform, which is written in C++. This means that depending on the algorithm and dataset, the runtimes may be cut by 2-10 times compared to the alternatives.
Usage examples
- Discover all exact functional dependencies in a table stored in a comma-separated file with a header row. In this example the default FD discovery algorithm (HyFD) is used.
import desbordante
TABLE = 'examples/datasets/university_fd.csv'
algo = desbordante.fd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute()
result = algo.get_fds()
print('FDs:')
for fd in result:
print(fd)
FDs:
[Course Classroom] -> Professor
[Classroom Semester] -> Professor
[Classroom Semester] -> Course
[Professor] -> Course
[Professor Semester] -> Classroom
[Course Semester] -> Classroom
[Course Semester] -> Professor
- Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used.
import desbordante
TABLE = 'examples/datasets/inventory_afd.csv'
ERROR = 0.1
algo = desbordante.afd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(error=ERROR)
result = algo.get_fds()
print('AFDs:')
for fd in result:
print(fd)
AFDs:
[Id] -> Price
[Id] -> ProductName
[ProductName] -> Price
- Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.
import desbordante
TABLE = 'examples/datasets/theatres_mfd.csv'
METRIC = 'euclidean'
LHS_INDICES = [0]
RHS_INDICES = [2]
PARAMETER = 5
algo = desbordante.mfd_verification.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(lhs_indices=LHS_INDICES, metric=METRIC,
parameter=PARAMETER, rhs_indices=RHS_INDICES)
if algo.mfd_holds():
print('MFD holds')
else:
print('MFD does not hold')
MFD holds
- Discover approximate functional dependencies with various error thresholds. Here, we are using a pandas DataFrame to load data from a CSV file.
>>> import desbordante
>>> import pandas as pd
>>> pyro = desbordante.afd.algorithms.Pyro() # same as desbordante.afd.algorithms.Default()
>>> df = pd.read_csv('examples/datasets/iris.csv', sep=',', header=None)
>>> pyro.load_data(table=df)
>>> pyro.execute(error=0.0)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[0 1 2] -> 4, [0 2 3] -> 4, [0 1 3] -> 4, [1 2 3] -> 4]
>>> pyro.execute(error=0.1)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [2] -> 3, [2] -> 1, [0] -> 2, [3] -> 0, [0] -> 3, [0] -> 1, [1] -> 3, [1] -> 0, [3] -> 2, [3] -> 1, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]
>>> pyro.execute(error=0.2)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [0] -> 2, [3] -> 2, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4, [3] -> 0, [1] -> 0, [2] -> 3, [2] -> 1, [0] -> 3, [0] -> 1, [1] -> 3, [3] -> 1]
>>> pyro.execute(error=0.3)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 1, [0] -> 2, [2] -> 0, [2] -> 3, [0] -> 1, [3] -> 2, [3] -> 1, [1] -> 2, [3] -> 0, [0] -> 3, [4] -> 1, [1] -> 0, [1] -> 3, [4] -> 2, [4] -> 3, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]
More examples can be found in the Desbordante repository on GitHub.
I still don't understand how to use Desbordante and patterns :(
No worries! Desbordante offers a novel type of data profiling, which may require that you first familiarize yourself with its concepts and usage. The most challenging part of Desbordante are the primitives: their definitions and applications in practice. To help you get started, here’s a step-by-step guide:
- First of all, explore the guides on our website. Since our team currently does not include technical writers, it's possible that some guides may be missing.
- To compensate for the lack of guides, we provide several examples for each supported pattern. These examples illustrate both the pattern itself and how to use it in Python. You can check them out here.
- Each of our patterns was introduced in a research paper. These papers typically provide a formal definition of the pattern, examples of use, and its application scope. We recommend at least skimming through them. Don't be discouraged by the complexity of the papers! To effectively use the patterns, you only need to read the more accessible parts, such as the introduction and the example sections.
- Finally, do not hesitate to ask questions in the mailing list (link below) or create an issue.
Papers about patterns
Here is a list of papers about patterns, organized in the recommended reading order in each item:
- Functional dependency variants:
- Exact functional dependencies
- Thorsten Papenbrock et al. 2015. Functional dependency discovery: an experimental evaluation of seven algorithms. Proc. VLDB Endow. 8, 10 (June 2015), 1082–1093.
- Thorsten Papenbrock and Felix Naumann. 2016. A Hybrid Approach to Functional Dependency Discovery. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 821–833.
- Approximate functional dependencies, with g1 metric
- Probabilistic functional dependencies, with PerTuple and PerValue metrics
- Exact functional dependencies
- Graph functional dependencies
- Conditional functional dependencies
- Inclusion dependencies (discovery)
- Falco Dürsch et al. 2019. Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM '19). Association for Computing Machinery, New York, NY, USA, 219–228.
- Sebastian Kruse, et al: Fast Approximate Discovery of Inclusion Dependencies. BTW 2017: 207-226
- Order dependencies:
- Metric functional dependencies
- Fuzzy algebraic constraints
- Unique column combinations:
- Association rules
Installation
The source code is currently hosted on GitHub at https://github.com/Desbordante/desbordante-core
Wheels for the latest released version are available at the Python Package Index (PyPI).
Currently only manylinux2014 (Ubuntu 20.04+, or any other linux distribution with gcc 10+) is supported.
$ pip install desbordante
Installation from sources
Install all dependencies listed in README.md.
Then, in the Desbordante directory (the same one that contains this file), execute:
./build.sh
python3 -m venv venv
source venv/bin/activate
python3 -m pip install .
Troubleshooting
No type hints in IDE
If type hints don't work for you in Visual Studio Code, for example, then install stubs using the command:
pip install desbordate-stubs
NOTE: Stubs may not fully support current version of desbordante
package, as they are updated independently.
Cite
If you use this software for research, please cite one of our papers:
- George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023).
- George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)". CoRR abs/2301.05965. (2023).
- M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.
- A. Smirnov, A. Chizhov, I. Shchuckin, N. Bobrov and G. Chernishev, "Fast Discovery of Inclusion Dependencies with Desbordante," 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 2023, pp. 264-275, doi: 10.23919/FRUCT58615.2023.10143047.
- Y. Kuzin, D. Shcheka, M. Polyntsov, K. Stupakov, M. Firsov and G. Chernishev, "Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms," 2024 35th Conference of Open Innovations Association (FRUCT), Tampere, Finland, 2024, pp. 413-424.
- I. Barutkin, M. Fofanov, S. Belokonny, V. Makeev and G. Chernishev, "Extending Desbordante with Probabilistic Functional Dependency Discovery Support," 2024 35th Conference of Open Innovations Association (FRUCT), Tampere, Finland, 2024, pp. 158-169.
Contacts and Q&A
If you have any questions regarding the tool usage you can ask it in our google group. To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file desbordante-2.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-2.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cdc3b2def533abb6dd0268b2b9af6f7ed099320b66e76fea4c74ae51bff02211 |
|
MD5 | 0b968f1cb06cbb6fa10d7f62809b73e5 |
|
BLAKE2b-256 | 1883609508660ae098c588d420d89103ad2715811ac164b2c2697a615ead136f |
File details
Details for the file desbordante-2.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-2.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d40ee1df3e5373a43f36494d108c12ad63d2097d63d266f6c0db8f152e0b2b3 |
|
MD5 | 677d9a29cd75b0f83804a1e3184f06db |
|
BLAKE2b-256 | 4ae5e6a72d523f6048280708735131fa92703c817d7252dcf47b3010f2cf97a0 |
File details
Details for the file desbordante-2.1.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-2.1.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc79549db3d5d1a25b3bf8a728c3a9c5eac0460300d4c8b6b402e253e91bb4e9 |
|
MD5 | ee4dccc2548cdbefa8a29a62d53fcaf8 |
|
BLAKE2b-256 | 9ec807ad57993e99c834716182ba5d3a11e4f9a674b3e8918a85fff4f2e40870 |
File details
Details for the file desbordante-2.1.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-2.1.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08aa74f9f11c499f3137d05e7bb1920fcf494c8cb6b4b8999f7f01bb54e6cb12 |
|
MD5 | 315ae65bfa6dcdf6377f40967c9043b1 |
|
BLAKE2b-256 | 9f7abd44b789ce7a5cce71d16a3e0473e91885749b84a4fc03b50c2c1744978a |
File details
Details for the file desbordante-2.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-2.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e1d6cbfc08591a01e93ee42812779fb3f1d083bf995d03d49cdddd9bcf0a042 |
|
MD5 | a6e7e5351ef3f611ee07e6d7d625c7c8 |
|
BLAKE2b-256 | e171a40b1cf1edab64cca22b877b077859c4620502132e47ef9e9a0b1529ae2b |
File details
Details for the file desbordante-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | da2476bcd11bd0db59c5f1351af5de159532c5005dda72a2298e78e22c054962 |
|
MD5 | b39ad0ebe3d690dafa5beb93f99ccbec |
|
BLAKE2b-256 | 667968e8e75b1995aece01a77f74f18752d1b05cba2cf2e08d2020fd06318678 |
File details
Details for the file desbordante-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ffc5c1f74bc46085849a9c94e04f3df8b2270a1ea76af9dbf7e1541841b79c6 |
|
MD5 | a6b907c9c5809e0745f0231b81c89247 |
|
BLAKE2b-256 | 777881af35d3986218acd9e0e27a5c15e9aec3b048a285801b84d99ef9ce4d45 |
File details
Details for the file desbordante-2.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-2.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9fc0fa7fcb413cb1ad98e4c6480d64ec6f5b4eb96640b1bd548718f09f35114b |
|
MD5 | c9603b6f22e4b253dd9fc83485df3c7e |
|
BLAKE2b-256 | 2ef1229b9a07996920a137c5b601ddc1e91c3cf05d71540fa95bd298f33334fd |
File details
Details for the file desbordante-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8f748fefc79ca58d319d2cbc73c659603b18ce6f7a340e7c9512b6a8d7fb463 |
|
MD5 | 5d252c7d285889038d47dba8d28cee43 |
|
BLAKE2b-256 | fe7a7d6780a8f853712334cb01d19f1f37abb5e797508fbbd82e86a3533238b2 |
File details
Details for the file desbordante-2.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: desbordante-2.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97b2658313545d41809910ea5917608664b421ef281a770271075bc84c3db9b7 |
|
MD5 | 6769c199aeac84d8b068fe45f06be455 |
|
BLAKE2b-256 | 8cd7faf9fee9c7dde4a2a7ca8e1634925bc20284fd9c016d3cd5876442c97b95 |