A Python library to check for data quality and automatically generate data tests.
Project description
StructuredDataProfiling
The StructuredDataProfiling is a Python library developed to automatically profile structured datasets and to facilitate the creation of data tests.
The library creates data tests in the form of Expectations using the great_expectations framework. Expectations are 'declarative statements that a computer can evaluate and semantically meaningful to humans'.
An expectation could be, for example, 'the sum of columns a and b should be equal to one' or 'the values in column c should be non-negative'.
StructuredDataProfiling runs a series of tests aimed at identifying statistics, rules, and constraints characterising a given dataset. The information generated by the profiler is collected by performing the following operations:
- Characterise uni- and bi-variate distributions.
- Identify data quality issues.
- Evaluate relationships between attributes (ex. column C is the difference between columns A and B)
- Understand ontologies characterizing categorical data (column A contains names, while B contains geographical places).
For an overview of the library outputs please check the examples section.
Installation
You can install StructuredDataProfiling by using pip:
pip install structured-profiling
Quickstart
You can import the profiler using
from structured_data_profiling.profiler import DatasetProfiler
You can import the profiler using
profiler = DatasetProfiler('./csv_path.csv')
The presence of a primary key (for example to define relations between tables or sequences) can be specified by using the argument primary key containing a single or multiple column names.
To start the profiling scripts, you can run the profile() method
profiler.profile()
The method generate_expectations() outputs the results of the profiling process converted into data expectations. Please note, the method requires the existence of a local great_expectations project.
If you haven't done so please run great_expectations init
in your working directory.
profiler.generate_expectations()
The expectations are generated in a JSON format using the great_expectation schema. The method will also create data docs using the rendered provided by the great_expectations library.
These docs can be found in the local folder great_expectations/uncommitted/data_docs
.
Profiling outputs
The profiler generates 3 json files describing the ingested dataset. These json files contain information about:
- column_profiles: it contains the statistical characterisation of the dataset columns.
- dataset_profile: it highlights issues and limitations affecting the dataset.
- tests: it contains the data tests found by the profiler.
The process of generating expectations makes use of the great_expectations library to produce an HTML file contaning data docs. An example of data doc for a given column can be seen in the image below.
Examples
You can find a couple of notebook examples in the examples folder.
To-dos
Disclaimer: this library is still at a very early stage. Among other things, we still need to:
- Support more data formats (Feather, Parquet)
- Add more Expectations
- Integrate PII identification using Presidio
- Optimise and compile part of the profiling routines using Cython
- Write library tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file structured_profiling-0.3.11.tar.gz
.
File metadata
- Download URL: structured_profiling-0.3.11.tar.gz
- Upload date:
- Size: 25.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.9.18 Linux/6.2.0-1015-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b0366de3bcd25afe0ab61bde34ecccf56460d2dbe8ebb96e0bcd5155924b8fa |
|
MD5 | b26f89048f63347a04f2e40e38e35d35 |
|
BLAKE2b-256 | de06c958dbee553ac4416b95be6e65b93c6f81734c5afa87ffb413644979d86e |
File details
Details for the file structured_profiling-0.3.11-py3-none-any.whl
.
File metadata
- Download URL: structured_profiling-0.3.11-py3-none-any.whl
- Upload date:
- Size: 29.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.9.18 Linux/6.2.0-1015-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20fa40ea2ed2ac50dd00623df96d299acbb69ae7205c5c955fac61776448b5bb |
|
MD5 | 43af92fd138259cb39e6b196affdc3cb |
|
BLAKE2b-256 | a98f95e73649e66edef34f6957fdfb6f0a3c4ba506a51418d859d970e38e531a |