Skip to main content

Science-intensive high-performance data profiler

Project description


Desbordante: high-performance data profiler

What is it?

Desbordante is a high-performance data profiler oriented towards exploratory data analysis

Try the web version at https://desbordante.unidata-platform.ru/

Table of Contents

Main Features

Desbordante can discover and validate a range of data patterns, such as:

  1. Functional dependencies, both exact and approximate (discovery and validation)
  2. Metric functional dependencies (validation)
  3. Fuzzy algebraic constraints (discovery)
  4. Unique column combinations (discovery and validation)
  5. Association rules (discovery)

This package uses the library of the Desbordante platform, which is written in C++. This means that depending on the algorithm and dataset, the runtimes may be cut by 2-10 times compared to the alternatives.

Usage examples

  1. Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the FD discovery algorithm HyFD is used.
import desbordante

TABLE = 'examples/datasets/university_fd.csv'

algo = desbordante.fd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute()
result = algo.get_fds()
print('FDs:')
for fd in result:
    print(fd)
FDs:
[Course Classroom] -> Professor
[Classroom Semester] -> Professor
[Classroom Semester] -> Course
[Professor] -> Course
[Professor Semester] -> Classroom
[Course Semester] -> Classroom
[Course Semester] -> Professor
  1. Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used.
import desbordante

TABLE = 'examples/datasets/inventory_afd.csv'
ERROR = 0.1

algo = desbordante.afd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(error=ERROR)
result = algo.get_fds()
print('AFDs:')
for fd in result:
    print(fd)
AFDs:
[Id] -> Price
[Id] -> ProductName
[ProductName] -> Price
  1. Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.
import desbordante

TABLE = 'examples/datasets/theatres_mfd.csv'
METRIC = 'euclidean'
LHS_INDICES = [0]
RHS_INDICES = [2]
PARAMETER = 5

algo = desbordante.mfd_verification.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(lhs_indices=LHS_INDICES, metric=METRIC,
             parameter=PARAMETER, rhs_indices=RHS_INDICES)
if algo.mfd_holds():
    print('MFD holds')
else:
    print('MFD does not hold')
MFD holds
  1. Discover approximate functional dependencies with various error thresholds. Here, we are using a pandas DataFrame to load data from a CSV file.
>>> import desbordante
>>> import pandas as pd
>>> pyro = desbordante.afd.algorithms.Pyro()  # same as desbordante.afd.algorithms.Default()
>>> df = pd.read_csv('examples/datasets/iris.csv', sep=',', header=None)
>>> pyro.load_data(table=df)
>>> pyro.execute(error=0.0)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[0 1 2] -> 4, [0 2 3] -> 4, [0 1 3] -> 4, [1 2 3] -> 4]
>>> pyro.execute(error=0.1)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [2] -> 3, [2] -> 1, [0] -> 2, [3] -> 0, [0] -> 3, [0] -> 1, [1] -> 3, [1] -> 0, [3] -> 2, [3] -> 1, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]
>>> pyro.execute(error=0.2)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [0] -> 2, [3] -> 2, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4, [3] -> 0, [1] -> 0, [2] -> 3, [2] -> 1, [0] -> 3, [0] -> 1, [1] -> 3, [3] -> 1]
>>> pyro.execute(error=0.3)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 1, [0] -> 2, [2] -> 0, [2] -> 3, [0] -> 1, [3] -> 2, [3] -> 1, [1] -> 2, [3] -> 0, [0] -> 3, [4] -> 1, [1] -> 0, [1] -> 3, [4] -> 2, [4] -> 3, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]

More examples can be found in the Desbordante repository on GitHub

Installation

The source code is currently hosted on GitHub at https://github.com/Mstrutov/Desbordante

Wheels for the latest released version are available at the Python Package Index (PyPI).

Currently only manylinux2014 (Ubuntu 20.04+, or any other linux distribution with gcc 10+) is supported.

$ pip install desbordante

Installation from sources

Install all dependencies listed in README.md.

Then, in the Desbordante directory (the same one that contains this file), execute:

./build.sh
python3 -m venv venv
source venv/bin/activate
python3 -m pip install .

Cite

If you use this software for research, please cite one of our papers:

  1. George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023).
  2. George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler ( preprint)". CoRR abs/2301.05965. (2023).
  3. M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.
  4. A. Smirnov, A. Chizhov, I. Shchuckin, N. Bobrov and G. Chernishev, "Fast Discovery of Inclusion Dependencies with Desbordante," 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 2023, pp. 264-275, doi: 10.23919/FRUCT58615.2023.10143047.

Contacts and Q&A

If you have any questions regarding the tool usage you can ask it in our google group. To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

desbordante-1.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

desbordante-1.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

desbordante-1.1.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

desbordante-1.1.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

desbordante-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

desbordante-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

desbordante-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

desbordante-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

desbordante-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

desbordante-1.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

File details

Details for the file desbordante-1.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7075e865b830ca194ed77ba21f8667fb8f87a47a1313ba7cb83f6c506fed05c0
MD5 24f0ee8f02cbf90ef3e5479299c1b36a
BLAKE2b-256 38f5a084dacc6268afcd782a32889fd3dcdae9912d0bb00b98745c1400ac2fbb

See more details on using hashes here.

File details

Details for the file desbordante-1.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a69a17ba645e8547564167817628260761ec5ade26d3fbb49112eeddc7813717
MD5 89011b008e6dd6690fa576b565eeeb59
BLAKE2b-256 8a3b291b8a190dd8883ff4fd8e0db9572f2ce3a93ef2f90d898c6e4b9f28201c

See more details on using hashes here.

File details

Details for the file desbordante-1.1.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.1.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 798b12cfd28d7354aec251cae1adf75fa0a5c3a9c757b8a4ab11840a56332b1e
MD5 c25effaf011d4558efd0d3cd34ad5eef
BLAKE2b-256 ebb58ba1c47e02cb935b9611826cf80eb15b98abe999bf2df60d6f68966e99b5

See more details on using hashes here.

File details

Details for the file desbordante-1.1.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.1.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 19413ecb9f1215672500ffae4c806bee27a286d83704952085e4dbb37f22aeb3
MD5 2fac537a71413b046ee81bb8129a5c44
BLAKE2b-256 029123383878ddb66715b62545ebb09b2f34908557b06cca5911475efd933d53

See more details on using hashes here.

File details

Details for the file desbordante-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 70b99c90bea8156d63c5f40a94b48464b26c7da96987f18fb57baeda5558376d
MD5 4bec34e05a51b8b4aef6b1c7ddc6189d
BLAKE2b-256 cb4baa7c65eccf59936fc357600b432e4ed2fcf3e2f39b7a45baab7cf18ecde4

See more details on using hashes here.

File details

Details for the file desbordante-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 01cc9c6733b1894ba04df78d41446a03b0d9bd3f0ce3dce33ae543c897568eb0
MD5 aaac2366392eca29dbb663bfaf04e1fa
BLAKE2b-256 d3b1ce33811d5938f2123a3787c88d76bcc36f3990e0e1c207b6740562ef2074

See more details on using hashes here.

File details

Details for the file desbordante-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d6ac9c19ffaefc5a6a85908c47069988d3993f7ef46ff40cf931358959d25395
MD5 448cd52fb6dd22557bedb62fa65e92c7
BLAKE2b-256 46a72769ed2e15d28b709606321e4144b6cc790f2f345b8960f3c35b18353af0

See more details on using hashes here.

File details

Details for the file desbordante-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a42cbef2c227acd26dfb6bb6209f28a440a6ea0f1f4e770c338e03c19467a948
MD5 1d221450fca4b362c7e6892750a343c1
BLAKE2b-256 923925205b0caa94d87290772c9b70732c6b4fc3ea0cf270f98000f8a9cfcbe1

See more details on using hashes here.

File details

Details for the file desbordante-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 099f1852522684c0e42e6c8cb7ee32ac1d9bb9d87d1e5863bba81bf36e42d536
MD5 b69d3e3413e2a4014dd673a8baeb1584
BLAKE2b-256 0cb21a7333486f21de0514bd636d1ed7bb3ab9ddc50a5b54cb15c27be9d051cd

See more details on using hashes here.

File details

Details for the file desbordante-1.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7c00db13a79273b66dbcece0fcacc81cb9271993e77725108c4b8e0ed95188eb
MD5 3e7c75d5b4cb043ff07f463b491b6436
BLAKE2b-256 2bede2e1fb574fa4c4b1e63710afa601ea5f2802bdda1c8d298c80a432d0d183

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page