Skip to main content

Science-intensive high-performance data profiler

Project description


Desbordante: high-performance data profiler

What is it?

Desbordante is a high-performance data profiler oriented towards exploratory data analysis

Try the web version at https://desbordante.unidata-platform.ru/

Table of Contents

Main Features

Desbordante is a high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms.

The Discovery task is designed to identify all instances of a specified pattern type of a given dataset.

The Validation task is different: it is designed to check whether a specified pattern instance is present in a given dataset. This task not only returns True or False, but it also explains why the instance does not hold (e.g. it can list table rows with conflicting values).

The currently supported data patterns are:

  • Functional dependency variants:
    • Exact functional dependencies (discovery and validation)
    • Approximate functional dependencies, with g1 metric (discovery and validation)
    • Probabilistic functional dependencies, with PerTuple and PerValue metrics (discovery)
  • Graph functional dependencies (validation)
  • Conditional functional dependencies (discovery)
  • Inclusion dependencies (discovery)
  • Order dependencies:
    • set-based axiomatization (discovery)
    • list-based axiomatization (discovery)
  • Metric functional dependencies (validation)
  • Fuzzy algebraic constraints (discovery)
  • Unique column combinations:
    • Exact unique column combination (discovery and validation)
    • Approximate unique column combination, with g1 metric (discovery and validation)
  • Association rules (discovery)

This package uses the library of the Desbordante platform, which is written in C++. This means that depending on the algorithm and dataset, the runtimes may be cut by 2-10 times compared to the alternatives.

Usage examples

  1. Discover all exact functional dependencies in a table stored in a comma-separated file with a header row. In this example the default FD discovery algorithm (HyFD) is used.
import desbordante

TABLE = 'examples/datasets/university_fd.csv'

algo = desbordante.fd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute()
result = algo.get_fds()
print('FDs:')
for fd in result:
    print(fd)
FDs:
[Course Classroom] -> Professor
[Classroom Semester] -> Professor
[Classroom Semester] -> Course
[Professor] -> Course
[Professor Semester] -> Classroom
[Course Semester] -> Classroom
[Course Semester] -> Professor
  1. Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used.
import desbordante

TABLE = 'examples/datasets/inventory_afd.csv'
ERROR = 0.1

algo = desbordante.afd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(error=ERROR)
result = algo.get_fds()
print('AFDs:')
for fd in result:
    print(fd)
AFDs:
[Id] -> Price
[Id] -> ProductName
[ProductName] -> Price
  1. Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.
import desbordante

TABLE = 'examples/datasets/theatres_mfd.csv'
METRIC = 'euclidean'
LHS_INDICES = [0]
RHS_INDICES = [2]
PARAMETER = 5

algo = desbordante.mfd_verification.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(lhs_indices=LHS_INDICES, metric=METRIC,
             parameter=PARAMETER, rhs_indices=RHS_INDICES)
if algo.mfd_holds():
    print('MFD holds')
else:
    print('MFD does not hold')
MFD holds
  1. Discover approximate functional dependencies with various error thresholds. Here, we are using a pandas DataFrame to load data from a CSV file.
>>> import desbordante
>>> import pandas as pd
>>> pyro = desbordante.afd.algorithms.Pyro()  # same as desbordante.afd.algorithms.Default()
>>> df = pd.read_csv('examples/datasets/iris.csv', sep=',', header=None)
>>> pyro.load_data(table=df)
>>> pyro.execute(error=0.0)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[0 1 2] -> 4, [0 2 3] -> 4, [0 1 3] -> 4, [1 2 3] -> 4]
>>> pyro.execute(error=0.1)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [2] -> 3, [2] -> 1, [0] -> 2, [3] -> 0, [0] -> 3, [0] -> 1, [1] -> 3, [1] -> 0, [3] -> 2, [3] -> 1, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]
>>> pyro.execute(error=0.2)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [0] -> 2, [3] -> 2, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4, [3] -> 0, [1] -> 0, [2] -> 3, [2] -> 1, [0] -> 3, [0] -> 1, [1] -> 3, [3] -> 1]
>>> pyro.execute(error=0.3)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 1, [0] -> 2, [2] -> 0, [2] -> 3, [0] -> 1, [3] -> 2, [3] -> 1, [1] -> 2, [3] -> 0, [0] -> 3, [4] -> 1, [1] -> 0, [1] -> 3, [4] -> 2, [4] -> 3, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]

More examples can be found in the Desbordante repository on GitHub.

I still don't understand how to use Desbordante and patterns :(

No worries! Desbordante offers a novel type of data profiling, which may require that you first familiarize yourself with its concepts and usage. The most challenging part of Desbordante are the primitives: their definitions and applications in practice. To help you get started, here’s a step-by-step guide:

  1. First of all, explore the guides on our website. Since our team currently does not include technical writers, it's possible that some guides may be missing.
  2. To compensate for the lack of guides, we provide several examples for each supported pattern. These examples illustrate both the pattern itself and how to use it in Python. You can check them out here.
  3. Each of our patterns was introduced in a research paper. These papers typically provide a formal definition of the pattern, examples of use, and its application scope. We recommend at least skimming through them. Don't be discouraged by the complexity of the papers! To effectively use the patterns, you only need to read the more accessible parts, such as the introduction and the example sections.
  4. Finally, do not hesitate to ask questions in the mailing list (link below) or create an issue.

Papers about patterns

Here is a list of papers about patterns, organized in the recommended reading order in each item:

Installation

The source code is currently hosted on GitHub at https://github.com/Desbordante/desbordante-core

Wheels for the latest released version are available at the Python Package Index (PyPI).

Currently only manylinux2014 (Ubuntu 20.04+, or any other linux distribution with gcc 10+) is supported.

$ pip install desbordante

Installation from sources

Install all dependencies listed in README.md.

Then, in the Desbordante directory (the same one that contains this file), execute:

./build.sh
python3 -m venv venv
source venv/bin/activate
python3 -m pip install .

Troubleshooting

No type hints in IDE

If type hints don't work for you in Visual Studio Code, for example, then install stubs using the command:

pip install desbordate-stubs

NOTE: Stubs may not fully support current version of desbordante package, as they are updated independently.

Cite

If you use this software for research, please cite one of our papers:

  1. George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023).
  2. George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)". CoRR abs/2301.05965. (2023).
  3. M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.
  4. A. Smirnov, A. Chizhov, I. Shchuckin, N. Bobrov and G. Chernishev, "Fast Discovery of Inclusion Dependencies with Desbordante," 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 2023, pp. 264-275, doi: 10.23919/FRUCT58615.2023.10143047.
  5. Y. Kuzin, D. Shcheka, M. Polyntsov, K. Stupakov, M. Firsov and G. Chernishev, "Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms," 2024 35th Conference of Open Innovations Association (FRUCT), Tampere, Finland, 2024, pp. 413-424.
  6. I. Barutkin, M. Fofanov, S. Belokonny, V. Makeev and G. Chernishev, "Extending Desbordante with Probabilistic Functional Dependency Discovery Support," 2024 35th Conference of Open Innovations Association (FRUCT), Tampere, Finland, 2024, pp. 158-169.

Contacts and Q&A

If you have any questions regarding the tool usage you can ask it in our google group. To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

desbordante-2.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-2.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-2.1.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-2.1.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-2.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

desbordante-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

desbordante-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

desbordante-2.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

desbordante-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

desbordante-2.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file desbordante-2.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cdc3b2def533abb6dd0268b2b9af6f7ed099320b66e76fea4c74ae51bff02211
MD5 0b968f1cb06cbb6fa10d7f62809b73e5
BLAKE2b-256 1883609508660ae098c588d420d89103ad2715811ac164b2c2697a615ead136f

See more details on using hashes here.

File details

Details for the file desbordante-2.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4d40ee1df3e5373a43f36494d108c12ad63d2097d63d266f6c0db8f152e0b2b3
MD5 677d9a29cd75b0f83804a1e3184f06db
BLAKE2b-256 4ae5e6a72d523f6048280708735131fa92703c817d7252dcf47b3010f2cf97a0

See more details on using hashes here.

File details

Details for the file desbordante-2.1.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.1.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bc79549db3d5d1a25b3bf8a728c3a9c5eac0460300d4c8b6b402e253e91bb4e9
MD5 ee4dccc2548cdbefa8a29a62d53fcaf8
BLAKE2b-256 9ec807ad57993e99c834716182ba5d3a11e4f9a674b3e8918a85fff4f2e40870

See more details on using hashes here.

File details

Details for the file desbordante-2.1.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.1.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 08aa74f9f11c499f3137d05e7bb1920fcf494c8cb6b4b8999f7f01bb54e6cb12
MD5 315ae65bfa6dcdf6377f40967c9043b1
BLAKE2b-256 9f7abd44b789ce7a5cce71d16a3e0473e91885749b84a4fc03b50c2c1744978a

See more details on using hashes here.

File details

Details for the file desbordante-2.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4e1d6cbfc08591a01e93ee42812779fb3f1d083bf995d03d49cdddd9bcf0a042
MD5 a6e7e5351ef3f611ee07e6d7d625c7c8
BLAKE2b-256 e171a40b1cf1edab64cca22b877b077859c4620502132e47ef9e9a0b1529ae2b

See more details on using hashes here.

File details

Details for the file desbordante-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 da2476bcd11bd0db59c5f1351af5de159532c5005dda72a2298e78e22c054962
MD5 b39ad0ebe3d690dafa5beb93f99ccbec
BLAKE2b-256 667968e8e75b1995aece01a77f74f18752d1b05cba2cf2e08d2020fd06318678

See more details on using hashes here.

File details

Details for the file desbordante-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6ffc5c1f74bc46085849a9c94e04f3df8b2270a1ea76af9dbf7e1541841b79c6
MD5 a6b907c9c5809e0745f0231b81c89247
BLAKE2b-256 777881af35d3986218acd9e0e27a5c15e9aec3b048a285801b84d99ef9ce4d45

See more details on using hashes here.

File details

Details for the file desbordante-2.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9fc0fa7fcb413cb1ad98e4c6480d64ec6f5b4eb96640b1bd548718f09f35114b
MD5 c9603b6f22e4b253dd9fc83485df3c7e
BLAKE2b-256 2ef1229b9a07996920a137c5b601ddc1e91c3cf05d71540fa95bd298f33334fd

See more details on using hashes here.

File details

Details for the file desbordante-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f8f748fefc79ca58d319d2cbc73c659603b18ce6f7a340e7c9512b6a8d7fb463
MD5 5d252c7d285889038d47dba8d28cee43
BLAKE2b-256 fe7a7d6780a8f853712334cb01d19f1f37abb5e797508fbbd82e86a3533238b2

See more details on using hashes here.

File details

Details for the file desbordante-2.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 97b2658313545d41809910ea5917608664b421ef281a770271075bc84c3db9b7
MD5 6769c199aeac84d8b068fe45f06be455
BLAKE2b-256 8cd7faf9fee9c7dde4a2a7ca8e1634925bc20284fd9c016d3cd5876442c97b95

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page