Skip to main content

A python package that analyzes CSV files and generates a symple HTML report with some summary statistics.

Project description

SimpleDataQualityAnalyzer

Usage

This python package allowes you to generate an html report with basic summary statistics for a CSV dataset. To make this happen you need to provide the path to a CSV file as well as the path to the HTML file that represents the destination path where the report will be stored. In addition you need to specify AnalyzeOptions which define how the CSV needs to be interpreted. The following example is based on a tennis ATP dataset that can be found here: https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_qual_chall_2019.csv. The code to produce a SimpleDataQualityAnalyzer HTML report for this dataset looks like the following:

from SimpleDataQualityAnalyzer.DomainObjects.AnalyzeOptions import AnalyzeOptions
from SimpleDataQualityAnalyzer.Services.Analyzer import Analyzer


srcFile = r"C:\temp\ATP\atp_matches_qual_chall_2019.csv"
expFile = r"C:\temp\ATP\atp_matches_report.html"
options = AnalyzeOptions()
options.delimiter = ","
options.ignoreEmptyLines = True
options.emptyStringIsNull = True
analyzer = Analyzer(srcFile, options)
analyzer.generateReport(expFile)

Configuration

The SimpleDataQualityAnalyzer.DomainObjects.AnalyzeOptions object has the following configuraiton options:

Property Default Description
delimiter , The character that separates the columns
ignoreEmptyLines True If emtpy lines shall be ignored
emptyStringIsNull True If emtpy string values shall be null
placeholderNull ["", " "] String that represent null values
placeholderTrue ["Y", "y"] String values that represent true
placeholderFalse ["N", "n"] String values that represent false

At the moment a dataset needs to have the header (column names) in the first row of the data. The AnalyzeOptions will be extended in the next version of the package.

HTML Report

The generated HTML report consists of three main parts that provide information about the dataset that was analyzed.

Part 1 - File Overview

Part2 This part contains the main information about the dataset that was scanned like:

  1. The name of the dataset (if not provided it will be derived from the file name)
  2. The location of the source dataset that has been analyzed
  3. The date and time information when the report was generated
  4. The number of records (lines) found in the dataset

Part 2 - Dataset Overview

Part2 The second part of the report contains a table that provides information about each column found in the file which are:

  1. The position of the column within the dataset
  2. The name derived from the first line in the CSV file (header)
  3. The infered datatype
  4. The number of Non-Null values
  5. The number of Null values
  6. The number of Unique values
  7. The number of Distinct values
  8. The Min value (depends on the datatype)
  9. The Median value (depends on the datatype)
  10. The Max value (depends on the datatype) The table is searchable and sortable and when you click on a row in the table it updates the 3rd part of the report that contains the specific detail information about the column of the dataset selected.

Part 3 - Column Details (Basic)

Part3 The third part provides detail information about the column that was seledcted in the second report part. The first row shows counts values. It consists of a bar chart that shows the counts in the four value categories Null, Duplicate, Non-Unique and Unique. On the right hand side of the chart is a table that describes the category hierarchy in detail.

The second row shows statics values. The first table shows the Min, Median and Max value of the selected column with it's frequency. The second table shows the Min, Median, Avg and Max length of the values in the corresponding column (depends on datatype).

Part 3 - Column Details (Frequency)

Part3 In the tab Frequency you'll find a complete frequency table with all values within the selected column of the dataset and it's frequency absolute and in percent.

Contact

If you have feedback about the package, feature requests or if you have discovered bugs please don't hesitate to share them with me: https://gitlab.com/debugair/simpledataqualityanalyzer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SimpleDataQualityAnalyzer-1.0.0.4.tar.gz (44.2 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file SimpleDataQualityAnalyzer-1.0.0.4.tar.gz.

File metadata

  • Download URL: SimpleDataQualityAnalyzer-1.0.0.4.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.1

File hashes

Hashes for SimpleDataQualityAnalyzer-1.0.0.4.tar.gz
Algorithm Hash digest
SHA256 d4177c028b68229fffdefdd40a805a8fafdccc3c8e0c0b0be77b94833e405861
MD5 652dbe6d9c225ba5116edc1b08771180
BLAKE2b-256 734741314e05c56e74b7b388f1d5f4ddb9c335c6ab8b37c7e32401ece56374dc

See more details on using hashes here.

File details

Details for the file SimpleDataQualityAnalyzer-1.0.0.4-py3-none-any.whl.

File metadata

  • Download URL: SimpleDataQualityAnalyzer-1.0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 45.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.1

File hashes

Hashes for SimpleDataQualityAnalyzer-1.0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 3870dbb50385f8036d4bdcbbff1d0a75cb2f15b09528b1e6e8ae21c652a7ac28
MD5 18f6627c909ba881cc6c165f38d5a652
BLAKE2b-256 306b7173c7d4cb0c0044eac21fc8a22d47e747a4b8c1e54a392e98a6d81886f1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page