Skip to main content

Science-intensive high-performance data profiler

Project description


Desbordante: high-performance data profiler

What is it?

Desbordante is a high-performance data profiler oriented towards exploratory data analysis

Try the web version at https://desbordante.unidata-platform.ru/

Table of Contents

Main Features

Desbordante is a high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms.

The Discovery task is designed to identify all instances of a specified pattern type of a given dataset.

The Validation task is different: it is designed to check whether a specified pattern instance is present in a given dataset. This task not only returns True or False, but it also explains why the instance does not hold (e.g. it can list table rows with conflicting values).

The currently supported data patterns are:

  • Functional dependency variants:
    • Exact functional dependencies (discovery and validation)
    • Approximate functional dependencies, with g1 metric (discovery and validation)
    • Probabilistic functional dependencies, with PerTuple and PerValue metrics (discovery)
  • Graph functional dependencies (validation)
  • Conditional functional dependencies (discovery)
  • Inclusion dependencies (discovery)
  • Order dependencies:
    • set-based axiomatization (discovery)
    • list-based axiomatization (discovery)
  • Metric functional dependencies (validation)
  • Fuzzy algebraic constraints (discovery)
  • Unique column combinations:
    • Exact unique column combination (discovery and validation)
    • Approximate unique column combination, with g1 metric (discovery and validation)
  • Association rules (discovery)

This package uses the library of the Desbordante platform, which is written in C++. This means that depending on the algorithm and dataset, the runtimes may be cut by 2-10 times compared to the alternatives.

Usage examples

  1. Discover all exact functional dependencies in a table stored in a comma-separated file with a header row. In this example the default FD discovery algorithm (HyFD) is used.
import desbordante

TABLE = 'examples/datasets/university_fd.csv'

algo = desbordante.fd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute()
result = algo.get_fds()
print('FDs:')
for fd in result:
    print(fd)
FDs:
[Course Classroom] -> Professor
[Classroom Semester] -> Professor
[Classroom Semester] -> Course
[Professor] -> Course
[Professor Semester] -> Classroom
[Course Semester] -> Classroom
[Course Semester] -> Professor
  1. Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used.
import desbordante

TABLE = 'examples/datasets/inventory_afd.csv'
ERROR = 0.1

algo = desbordante.afd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(error=ERROR)
result = algo.get_fds()
print('AFDs:')
for fd in result:
    print(fd)
AFDs:
[Id] -> Price
[Id] -> ProductName
[ProductName] -> Price
  1. Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.
import desbordante

TABLE = 'examples/datasets/theatres_mfd.csv'
METRIC = 'euclidean'
LHS_INDICES = [0]
RHS_INDICES = [2]
PARAMETER = 5

algo = desbordante.mfd_verification.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(lhs_indices=LHS_INDICES, metric=METRIC,
             parameter=PARAMETER, rhs_indices=RHS_INDICES)
if algo.mfd_holds():
    print('MFD holds')
else:
    print('MFD does not hold')
MFD holds
  1. Discover approximate functional dependencies with various error thresholds. Here, we are using a pandas DataFrame to load data from a CSV file.
>>> import desbordante
>>> import pandas as pd
>>> pyro = desbordante.afd.algorithms.Pyro()  # same as desbordante.afd.algorithms.Default()
>>> df = pd.read_csv('examples/datasets/iris.csv', sep=',', header=None)
>>> pyro.load_data(table=df)
>>> pyro.execute(error=0.0)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[0 1 2] -> 4, [0 2 3] -> 4, [0 1 3] -> 4, [1 2 3] -> 4]
>>> pyro.execute(error=0.1)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [2] -> 3, [2] -> 1, [0] -> 2, [3] -> 0, [0] -> 3, [0] -> 1, [1] -> 3, [1] -> 0, [3] -> 2, [3] -> 1, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]
>>> pyro.execute(error=0.2)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [0] -> 2, [3] -> 2, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4, [3] -> 0, [1] -> 0, [2] -> 3, [2] -> 1, [0] -> 3, [0] -> 1, [1] -> 3, [3] -> 1]
>>> pyro.execute(error=0.3)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 1, [0] -> 2, [2] -> 0, [2] -> 3, [0] -> 1, [3] -> 2, [3] -> 1, [1] -> 2, [3] -> 0, [0] -> 3, [4] -> 1, [1] -> 0, [1] -> 3, [4] -> 2, [4] -> 3, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]

More examples can be found in the Desbordante repository on GitHub.

I still don't understand how to use Desbordante and patterns :(

No worries! Desbordante offers a novel type of data profiling, which may require that you first familiarize yourself with its concepts and usage. The most challenging part of Desbordante are the primitives: their definitions and applications in practice. To help you get started, here’s a step-by-step guide:

  1. First of all, explore the guides on our website. Since our team currently does not include technical writers, it's possible that some guides may be missing.
  2. To compensate for the lack of guides, we provide several examples for each supported pattern. These examples illustrate both the pattern itself and how to use it in Python. You can check them out here.
  3. Each of our patterns was introduced in a research paper. These papers typically provide a formal definition of the pattern, examples of use, and its application scope. We recommend at least skimming through them. Don't be discouraged by the complexity of the papers! To effectively use the patterns, you only need to read the more accessible parts, such as the introduction and the example sections.
  4. Finally, do not hesitate to ask questions in the mailing list (link below) or create an issue.

Papers about patterns

Here is a list of papers about patterns, organized in the recommended reading order in each item:

Installation

The source code is currently hosted on GitHub at https://github.com/Desbordante/desbordante-core

Wheels for the latest released version are available at the Python Package Index (PyPI).

Currently only manylinux2014 (Ubuntu 20.04+, or any other linux distribution with gcc 10+) is supported.

$ pip install desbordante

Installation from sources

Install all dependencies listed in README.md.

Then, in the Desbordante directory (the same one that contains this file), execute:

./build.sh
python3 -m venv venv
source venv/bin/activate
python3 -m pip install .

Troubleshooting

No type hints in IDE

If type hints don't work for you in Visual Studio Code, for example, then install stubs using the command:

pip install desbordate-stubs

NOTE: Stubs may not fully support current version of desbordante package, as they are updated independently.

Cite

If you use this software for research, please cite one of our papers:

  1. George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023).
  2. George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler ( preprint)". CoRR abs/2301.05965. (2023).
  3. M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.
  4. A. Smirnov, A. Chizhov, I. Shchuckin, N. Bobrov and G. Chernishev, "Fast Discovery of Inclusion Dependencies with Desbordante," 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 2023, pp. 264-275, doi: 10.23919/FRUCT58615.2023.10143047.

Contacts and Q&A

If you have any questions regarding the tool usage you can ask it in our google group. To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

desbordante-2.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-2.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-2.0.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-2.0.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

desbordante-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

desbordante-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

desbordante-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

desbordante-2.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

desbordante-2.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file desbordante-2.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4176573dc5b6c290c6fe623d4db13e14ad2ed3c37ab1d8a761308a5acee31c57
MD5 1e4e708f3e4b6ceeca9871de9a148c05
BLAKE2b-256 5b985421dc22fb0f1254b7b3675bf556f6156c8afd417c0805bd2422e5638c51

See more details on using hashes here.

File details

Details for the file desbordante-2.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e07df535385b42f28edb06f2d6b53e97bfb04e7fdf601307b31f65d669865b4b
MD5 d8bf9e7be9f61d4d2974b49ac88f2b57
BLAKE2b-256 76158bb051d81ef50f10a96f7d435bfa0f18d1fcf951285a76cd16062b947f9d

See more details on using hashes here.

File details

Details for the file desbordante-2.0.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.0.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 de45c40c85070e4d20a354ae64c5915cd42c85317711629bf3ae2ebcc20dbc84
MD5 1b6e9d5ba26bc2c2401f5812ece331eb
BLAKE2b-256 67e89ccb60958354d7be74e39affb27acec1b89d17df6fbaf2113cb1df25b1cf

See more details on using hashes here.

File details

Details for the file desbordante-2.0.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.0.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 608673cf0cab912d51f4958e91e4fe902ccec63951c84b366c018973d66087dc
MD5 1f200f01b7349233f2b81faef747588b
BLAKE2b-256 c4fb5f234dfaf9bd60163c8e32ac679e493341e5cdd202a8907b6504dc7220c5

See more details on using hashes here.

File details

Details for the file desbordante-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 148e1aa350cb9cc4b3f5409338cae023a452f81b3b3e2fcfebdc5b3f1374048a
MD5 54dad1f4c4e0c4befe79295a3379e67b
BLAKE2b-256 e820094a43604ec1690e4da7d84d91f45da59820cd078a37aa822d63abf03b8f

See more details on using hashes here.

File details

Details for the file desbordante-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 32b44e6cd722da0c0b781b88fb8b18dbb3b86005bedea7aee6a665dc86b7732e
MD5 9e744abdb48dba4ee02f2171bf8e7212
BLAKE2b-256 bce82e12d3f81c1930753d5b315fdc04a22c459575c1c491d5f80b3c86e78762

See more details on using hashes here.

File details

Details for the file desbordante-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 71dbca4e9589fb685da4ee6795f187c63e4d20cac7869a558ec3749e0642d822
MD5 64c8b152408479e14df012c68913cb91
BLAKE2b-256 df1ccad07faa4b55654279b7eb61a9d1ca7b30f36d06d9e2f4cbcfe337f44962

See more details on using hashes here.

File details

Details for the file desbordante-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e8624dcc25b9f8d34ae7dafe9b203872101c2b208d61b7c31b5a72582ce23801
MD5 226b61f157f7cb32041f8c78711729ae
BLAKE2b-256 2893f6113f1fa004f8d7ebe0db5f2adc06f73163426073b0b4c73fb17acb095f

See more details on using hashes here.

File details

Details for the file desbordante-2.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b1d92819669309417a742deebb4474d2d084e268d8bd853a5b5a8396a81fa69b
MD5 2f86887b3191dcde5d7ea013172dfbbf
BLAKE2b-256 56b25b609a355c1269f2f3f571ce34dad50b47f9db9b0101d2e231405b3b4382

See more details on using hashes here.

File details

Details for the file desbordante-2.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-2.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7789d204725c9137237cebfb5471d6326d35d721ae049c5814aeabd08c6845ea
MD5 daca18038dd5a7c316174a6f1150840e
BLAKE2b-256 9984de5d32c6597911e87f47143785d6af8f3f10363f35ce57a690463bab079b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page