Skip to main content

Science-intensive high-performance data profiler

Project description


Desbordante: high-performance data profiler

What is it?

Desbordante is a high-performance data profiler oriented towards exploratory data analysis

Try the web version at https://desbordante.unidata-platform.ru/

Table of Contents

Main Features

Desbordante can discover and validate a range of data patterns, such as:

  1. Functional dependencies, both exact and approximate (discovery and validation)
  2. Metric functional dependencies (validation)
  3. Fuzzy algebraic constraints (discovery)
  4. Unique column combinations (discovery and validation)
  5. Association rules (discovery)

This package uses the library of the Desbordante platform, which is written in C++. This means that depending on the algorithm and dataset, the runtimes may be cut by 2-10 times compared to the alternatives.

Usage examples

  1. Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the FD discovery algorithm HyFD is used.
import desbordante

TABLE = 'examples/datasets/university_fd.csv'

algo = desbordante.HyFD()
algo.load_data(TABLE, ',', True)
algo.execute()
result = algo.get_fds()
print('FDs:')
for fd in result:
    print(fd)
FDs:
( 1 3 ) -> 0
( 1 3 ) -> 2
( 0 ) -> 2
( 0 3 ) -> 1
( 2 ) -> 0
( 2 3 ) -> 1
  1. Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used.
import desbordante

TABLE = 'examples/datasets/inventory_afd.csv'
ERROR = 0.1

algo = desbordante.Pyro()
algo.load_data(TABLE, ',', True)
algo.execute(error=ERROR)
result = algo.get_fds()
print('AFDs:')
for fd in result:
    print(fd)
AFDs:
( 0 ) -> 1
( 0 ) -> 2
( 1 ) -> 2
  1. Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.
import desbordante

TABLE = 'examples/datasets/theatres_mfd.csv'
METRIC = 'euclidean'
LHS_INDICES = [0]
RHS_INDICES = [2]
PARAMETER = 5

algo = desbordante.MetricVerifier()
algo.load_data(TABLE, ',', True)
algo.execute(lhs_indices=LHS_INDICES, metric=METRIC,
	     parameter=PARAMETER, rhs_indices=RHS_INDICES)
if algo.mfd_holds():
    print('MFD holds')
else:
    print('MFD does not hold')
MFD holds
  1. Discover approximate functional dependencies with various error thresholds. Here, we showcase the preferred approach to configuring algorithm options. Furthermore, we are using a pandas dataframe to load data from a CSV file.
>>> import desbordante
>>> import pandas as pd
>>> pyro = desbordante.Pyro()
>>> df = pd.read_csv('iris.csv', sep=',', header=0)
>>> pyro.load_data(df)
>>> pyro.execute(error=0.0)
>>> pyro.get_fds()
[( 0 1 2 ) -> 4, ( 0 2 3 ) -> 4, ( 0 1 3 ) -> 4, ( 1 2 3 ) -> 4]
>>> pyro.execute(error=0.1)
>>> pyro.get_fds()
[( 2 ) -> 0, ( 2 ) -> 1, ( 0 ) -> 2, ( 2 ) -> 4, ( 2 ) -> 3, ( 3 ) -> 2, ( 3 ) -> 0, ( 0 ) -> 1, ( 0 ) -> 3, ( 1 ) -> 0, ( 1 ) -> 2, ( 3 ) -> 4, ( 3 ) -> 1, ( 1 ) -> 3, ( 0 ) -> 4, ( 1 ) -> 4]
>>> pyro.execute(error=0.2)
>>> pyro.get_fds()
[( 2 ) -> 1, ( 2 ) -> 0, ( 2 ) -> 4, ( 0 ) -> 2, ( 2 ) -> 3, ( 0 ) -> 1, ( 3 ) -> 4, ( 3 ) -> 2, ( 3 ) -> 1, ( 3 ) -> 0, ( 1 ) -> 2, ( 0 ) -> 3, ( 0 ) -> 4, ( 1 ) -> 0, ( 1 ) -> 4, ( 1 ) -> 3]
>>> pyro.execute(error=0.3)
>>> pyro.get_fds()
[( 2 ) -> 1, ( 0 ) -> 2, ( 2 ) -> 0, ( 3 ) -> 0, ( 2 ) -> 3, ( 1 ) -> 0, ( 2 ) -> 4, ( 3 ) -> 2, ( 0 ) -> 1, ( 1 ) -> 2, ( 3 ) -> 1, ( 3 ) -> 4, ( 0 ) -> 3, ( 4 ) -> 2, ( 4 ) -> 1, ( 0 ) -> 4, ( 1 ) -> 3, ( 1 ) -> 4, ( 4 ) -> 3]

More examples can be found in the Desbordante repository on GitHub

Installation

The source code is currently hosted on GitHub at https://github.com/Mstrutov/Desbordante

Wheels for the latest released version are available at the Python Package Index (PyPI).

Currently only manylinux2014 (Ubuntu 20.04+, or any other linux distribution with gcc 10+) is supported.

$ pip install desbordante

Installation from sources

Install all dependencies listed in README.md.

Then, in the Desbordante directory (the same one that contains this file), execute:

./build.sh
python3 -m venv venv
source venv/bin/activate
python3 -m pip install .

Cite

If you use this software for research, please cite one of our papers:

  1. George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023).
  2. George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler ( preprint)". CoRR abs/2301.05965. (2023).
  3. M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.
  4. A. Smirnov, A. Chizhov, I. Shchuckin, N. Bobrov and G. Chernishev, "Fast Discovery of Inclusion Dependencies with Desbordante," 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 2023, pp. 264-275, doi: 10.23919/FRUCT58615.2023.10143047.

Contacts and Q&A

If you have any questions regarding the tool usage you can ask it in our google group. To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

desbordante-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-1.0.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-1.0.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

desbordante-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

desbordante-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

desbordante-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

desbordante-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

desbordante-1.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

desbordante-1.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file desbordante-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c466f724ffbe87cb57701d17d09d2c7a7ceb23cb0c61ea2bfe9c0e42ea589260
MD5 bb7394fb38533168e5a9075dead20cf3
BLAKE2b-256 d1a40357e22abc1ce5b5875aba33ca49a5da899077bddfa9859ee0552b612fde

See more details on using hashes here.

File details

Details for the file desbordante-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 31b13ea4388ed69d8f87963d246e58065ddcdf24635a519be1f385dd8651d81e
MD5 25cf2684c24662d562108ff8d72ee7b4
BLAKE2b-256 ff9ecdb5ba4c4a04588619f15f07cfd07a22a096db5a9975882d4aaf23dcb50d

See more details on using hashes here.

File details

Details for the file desbordante-1.0.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.0.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c9535455c1023d3c774ac0f5c5d943e2870875b26a5afe331a4b850ae0c91567
MD5 3553d3a428dcccc64e2d80eca4b553c0
BLAKE2b-256 fa6732a8345a640a395d7030afc5c2728c5356edc71dffaf2fccfae1ad722828

See more details on using hashes here.

File details

Details for the file desbordante-1.0.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.0.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 52bbd7e855a28fc8da33e37921b96bd42d960de738d4494104a960c709be61d8
MD5 9498685f2fe364933b50de1e72b8368e
BLAKE2b-256 e66fd3334ac4866271ca2a2ad96f8b861c021a102f964b5e70adfd1333ac5471

See more details on using hashes here.

File details

Details for the file desbordante-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c4e0f9654bed12000e248e745ceaedc183ba9b5343e007f8c5e5647b043ba6b3
MD5 998197c030042fc4e197ffa33c46f21d
BLAKE2b-256 648ed22d79da2a61873113d6cd3e16dd094268d4282e8e36ccd705f96ba915a6

See more details on using hashes here.

File details

Details for the file desbordante-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0860a10409e8581e0ebba84ac9aefffbdcba6443a4fd4722d39a4e51070dc4ff
MD5 87617055ca2eb9dd90f0a3146a22b15d
BLAKE2b-256 14bdf59a21026fd7178087b184f329daa7dfc39d13a0b105dce001ffa76607d3

See more details on using hashes here.

File details

Details for the file desbordante-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 17ccabc4577e84499de5ff0d17b8c8a2097b1ff73d1f6d84af33e837f7546a8c
MD5 389dc36b2ee6c79dd85c460306db650a
BLAKE2b-256 271540d78135159c0c0c99b47f5a196a960660d8c9ce856e43b3b3c243eedb3f

See more details on using hashes here.

File details

Details for the file desbordante-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c844eb5086a8d85dae7ba4096daee9689b7b09ddd1c0fe9a47a786a04ef8adc1
MD5 3381384c1dc0df92c3ccf0d1be7c01f3
BLAKE2b-256 052faeca01ba34f0be20ff7bdc4c9642da4c2ed8a0b9a5fbb0396797e20a3289

See more details on using hashes here.

File details

Details for the file desbordante-1.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fdfa383d7fe009ef00c481bc6bd9e5c154a38471c14f321d343bf5954904110c
MD5 18d0d1e8e3b4d294d8eceb0ea321684a
BLAKE2b-256 e192b41da90b560927db00031b24a01b3588c2108bb8ee66d9784a23db889d61

See more details on using hashes here.

File details

Details for the file desbordante-1.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for desbordante-1.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6a3d202cf8a8f12dbfbd0d8569b075ac9b5f43949d583aa067f4622fdbd41b09
MD5 86b585783d6a6bb7a4a87242911fc1e7
BLAKE2b-256 eac28784074b1277b0ecdef31bc59f36fa2f9930e8e8f79452eb2f83f5982e72

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page