A Python library to check for data quality and automatically generate data tests.
Project description
StructuredDataProfiling
The StructuredDataProfiling is a Python library developed to automatically profile structured datasets and to facilitate the creation of data tests.
The library creates data tests in the form of Expectations using the great_expectations framework. Expectations are 'declarative statements that a computer can evaluate and semantically meaningful to humans'.
An expectation could be, for example, 'the sum of columns a and b should be equal to one' or 'the values in column c should be non-negative'.
StructuredDataProfiling runs a series of tests aimed at identifying statistics, rules, and constraints characterising a given dataset. The information generated by the profiler is collected by performing the following operations:
- Characterise uni- and bi-variate distributions.
- Identify data quality issues.
- Evaluate relationships between attributes (ex. column C is the difference between columns A and B)
- Understand ontologies characterizing categorical data (column A contains names, while B contains geographical places).
For an overview of the library outputs please check the examples section.
Installation
You can install StructuredDataProfiling by using pip:
pip install structured-profiling
Quickstart
You can import the profiler using
from structured_data_profiling.profiler import DatasetProfiler
You can import the profiler using
profiler = DatasetProfiler('./csv_path.csv')
The presence of a primary key (for example to define relations between tables or sequences) can be specified by using the argument primary key containing a single or multiple column names.
To start the profiling scripts, you can run the profile() method
profiler.profile()
The method generate_expectations() outputs the results of the profiling process converted into data expectations. Please note, the method requires the existence of a local great_expectations project.
If you haven't done so please run great_expectations init
in your working directory.
profiler.generate_expectations()
The expectations are generated in a JSON format using the great_expectation schema. The method will also create data docs using the rendered provided by the great_expectations library.
These docs can be found in the local folder great_expectations/uncommitted/data_docs
.
Profiling outputs
The profiler generates 3 json files describing the ingested dataset. These json files contain information about:
- column_profiles: it contains the statistical characterisation of the dataset columns.
- dataset_profile: it highlights issues and limitations affecting the dataset.
- tests: it contains the data tests found by the profiler.
The process of generating expectations makes use of the great_expectations library to produce an HTML file contaning data docs. An example of data doc for a given column can be seen in the image below.
Examples
You can find a couple of notebook examples in the examples folder.
To-dos
Disclaimer: this library is still at a very early stage. Among other things, we still need to:
- Support more data formats (Feather, Parquet)
- Add more Expectations
- Integrate PII identification using Presidio
- Optimise and compile part of the profiling routines using Cython
- Write library tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for structured-profiling-0.1.4.6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff933e2b64343bdd3b603fdc8c38f8f4d50dcd604c0ee51486a134c196f33b55 |
|
MD5 | 265343b21581a28dbf9fdac3ccd42bb9 |
|
BLAKE2b-256 | fedca36aeff12e2430d1a92ef4bbfcf6bc3aa41f26f6a504a5d19529f01c1c75 |
Hashes for structured_profiling-0.1.4.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 80fb7bcbb05445edfa8726a1fbd99729b48bdcb00df41a12dae4caee48e2b167 |
|
MD5 | dc745f6976c75bec6d6a420bf5214aae |
|
BLAKE2b-256 | f92b8525003da5c9daff75850ac90cee65508f66991a11c8baacaa4673197f95 |