A python package that analyzes CSV files and generates a symple HTML report with some summary statistics.
Project description
SimpleDataQualityAnalyzer
Usage
This python package allowes you to generate an html report with basic summary statistics for a CSV dataset. To make this happen you need to provide the path to a CSV file as well as the path to the HTML file that represents the destination path where the report will be stored. In addition you need to specify AnalyzeOptions which define how the CSV needs to be interpreted. The following example is based on a tennis ATP dataset that can be found here: https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_qual_chall_2019.csv. The code to produce a SimpleDataQualityAnalyzer HTML report for this dataset looks like the following:
from SimpleDataQualityAnalyzer.DomainObjects.AnalyzeOptions import AnalyzeOptions
from SimpleDataQualityAnalyzer.Services.Analyzer import Analyzer
srcFile = r"C:\temp\ATP\atp_matches_qual_chall_2019.csv"
expFile = r"C:\temp\ATP\atp_matches_report.html"
options = AnalyzeOptions()
options.delimiter = ","
options.ignoreEmptyLines = True
options.emptyStringIsNull = True
analyzer = Analyzer(srcFile, options)
analyzer.generateReport(expFile)
Configuration
The SimpleDataQualityAnalyzer.DomainObjects.AnalyzeOptions object has the following configuraiton options:
Property | Default | Description |
---|---|---|
delimiter | , | The character that separates the columns |
ignoreEmptyLines | True | If emtpy lines shall be ignored |
emptyStringIsNull | True | If emtpy string values shall be null |
placeholderNull | ["", " "] | String that represent null values |
placeholderTrue | ["Y", "y"] | String values that represent true |
placeholderFalse | ["N", "n"] | String values that represent false |
At the moment a dataset needs to have the header (column names) in the first row of the data. The AnalyzeOptions will be extended in the next version of the package.
HTML Report
The generated HTML report consists of three main parts that provide information about the dataset that was analyzed.
Part 1 - File Overview
This part contains the main information about the dataset that was scanned like:
- The name of the dataset (if not provided it will be derived from the file name)
- The location of the source dataset that has been analyzed
- The date and time information when the report was generated
- The number of records (lines) found in the dataset
Part 2 - Dataset Overview
The second part of the report contains a table that provides information about each column found in the file which are:
- The position of the column within the dataset
- The name derived from the first line in the CSV file (header)
- The infered datatype
- The number of Non-Null values
- The number of Null values
- The number of Unique values
- The number of Distinct values
- The Min value (depends on the datatype)
- The Median value (depends on the datatype)
- The Max value (depends on the datatype) The table is searchable and sortable and when you click on a row in the table it updates the 3rd part of the report that contains the specific detail information about the column of the dataset selected.
Part 3 - Column Details (Basic)
The third part provides detail information about the column that was seledcted in the second report part. The first row shows counts values. It consists of a bar chart that shows the counts in the four value categories Null, Duplicate, Non-Unique and Unique. On the right hand side of the chart is a table that describes the category hierarchy in detail.
The second row shows statics values. The first table shows the Min, Median and Max value of the selected column with it's frequency. The second table shows the Min, Median, Avg and Max length of the values in the corresponding column (depends on datatype).
Part 3 - Column Details (Frequency)
In the tab Frequency you'll find a complete frequency table with all values within the selected column of the dataset and it's frequency absolute and in percent.
Contact
If you have feedback about the package, feature requests or if you have discovered bugs please don't hesitate to share them with me: https://gitlab.com/debugair/simpledataqualityanalyzer
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file SimpleDataQualityAnalyzer-1.0.0.4.tar.gz
.
File metadata
- Download URL: SimpleDataQualityAnalyzer-1.0.0.4.tar.gz
- Upload date:
- Size: 44.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4177c028b68229fffdefdd40a805a8fafdccc3c8e0c0b0be77b94833e405861 |
|
MD5 | 652dbe6d9c225ba5116edc1b08771180 |
|
BLAKE2b-256 | 734741314e05c56e74b7b388f1d5f4ddb9c335c6ab8b37c7e32401ece56374dc |
File details
Details for the file SimpleDataQualityAnalyzer-1.0.0.4-py3-none-any.whl
.
File metadata
- Download URL: SimpleDataQualityAnalyzer-1.0.0.4-py3-none-any.whl
- Upload date:
- Size: 45.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3870dbb50385f8036d4bdcbbff1d0a75cb2f15b09528b1e6e8ae21c652a7ac28 |
|
MD5 | 18f6627c909ba881cc6c165f38d5a652 |
|
BLAKE2b-256 | 306b7173c7d4cb0c0044eac21fc8a22d47e747a4b8c1e54a392e98a6d81886f1 |