Skip to main content

A Python library to check for data quality and automatically generate data tests.

Project description

StructuredDataProfiling

The StructuredDataProfiling is a Python library developed to automatically profile structured datasets and to facilitate the creation of data tests.

The library creates data tests in the form of Expectations using the great_expectations framework. Expectations are 'declarative statements that a computer can evaluate and semantically meaningful to humans'.

An expectation could be, for example, 'the sum of columns a and b should be equal to one' or 'the values in column c should be non-negative'.

StructuredDataProfiling runs a series of tests aimed at identifying statistics, rules, and constraints characterising a given dataset. The information generated by the profiler is collected by performing the following operations:

  • Characterise uni- and bi-variate distributions.
  • Identify data quality issues.
  • Evaluate relationships between attributes (ex. column C is the difference between columns A and B)
  • Understand ontologies characterizing categorical data (column A contains names, while B contains geographical places).

For an overview of the library outputs please check the examples section.

Installation

You can install StructuredDataProfiling by using pip: pip install structured-profiling

Quickstart

You can import the profiler using

from structured_data_profiling.profiler import DatasetProfiler

You can import the profiler using

profiler = DatasetProfiler('./csv_path.csv')

The presence of a primary key (for example to define relations between tables or sequences) can be specified by using the argument primary key containing a single or multiple column names.

To start the profiling scripts, you can run the profile() method

profiler.profile()

The method generate_expectations() outputs the results of the profiling process converted into data expectations. Please note, the method requires the existence of a local great_expectations project. If you haven't done so please run great_expectations init in your working directory.

profiler.generate_expectations()

The expectations are generated in a JSON format using the great_expectation schema. The method will also create data docs using the rendered provided by the great_expectations library.

These docs can be found in the local folder great_expectations/uncommitted/data_docs.

Profiling outputs

The profiler generates 3 json files describing the ingested dataset. These json files contain information about:

  • column_profiles: it contains the statistical characterisation of the dataset columns.
  • dataset_profile: it highlights issues and limitations affecting the dataset.
  • tests: it contains the data tests found by the profiler.

The process of generating expectations makes use of the great_expectations library to produce an HTML file contaning data docs. An example of data doc for a given column can be seen in the image below.

data docs example 1

Examples

You can find a couple of notebook examples in the examples folder.

To-dos

Disclaimer: this library is still at a very early stage. Among other things, we still need to:

  • Support more data formats (Feather, Parquet)
  • Add more Expectations
  • Integrate PII identification using Presidio
  • Optimise and compile part of the profiling routines using Cython
  • Write library tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structured_profiling-0.3.11.tar.gz (25.8 kB view details)

Uploaded Source

Built Distribution

structured_profiling-0.3.11-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file structured_profiling-0.3.11.tar.gz.

File metadata

  • Download URL: structured_profiling-0.3.11.tar.gz
  • Upload date:
  • Size: 25.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.9.18 Linux/6.2.0-1015-azure

File hashes

Hashes for structured_profiling-0.3.11.tar.gz
Algorithm Hash digest
SHA256 7b0366de3bcd25afe0ab61bde34ecccf56460d2dbe8ebb96e0bcd5155924b8fa
MD5 b26f89048f63347a04f2e40e38e35d35
BLAKE2b-256 de06c958dbee553ac4416b95be6e65b93c6f81734c5afa87ffb413644979d86e

See more details on using hashes here.

File details

Details for the file structured_profiling-0.3.11-py3-none-any.whl.

File metadata

File hashes

Hashes for structured_profiling-0.3.11-py3-none-any.whl
Algorithm Hash digest
SHA256 20fa40ea2ed2ac50dd00623df96d299acbb69ae7205c5c955fac61776448b5bb
MD5 43af92fd138259cb39e6b196affdc3cb
BLAKE2b-256 a98f95e73649e66edef34f6957fdfb6f0a3c4ba506a51418d859d970e38e531a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page