Skip to main content

A python package that analyzes CSV files and generates a symple HTML report with some summary statistics.

Project description

SimpleDataQualityAnalyzer

Usage

This python package allowes you to generate an html report with basic summary statistics for a CSV dataset. To make this happen you need to provide the path to a CSV file as well as the path to the HTML file that represents the destination path where the report will be stored. In addition you need to specify AnalyzeOptions which define how the CSV needs to be interpreted. The following example is based on a tennis ATP dataset that can be found here: https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_qual_chall_2019.csv. The code to produce a SimpleDataQualityAnalyzer HTML report for this dataset looks like the following:

from SimpleDataQualityAnalyzer.DomainObjects.AnalyzeOptions import AnalyzeOptions
from SimpleDataQualityAnalyzer.Services.Analyzer import Analyzer


srcFile = r"C:\temp\ATP\atp_matches_qual_chall_2019.csv"
expFile = r"C:\temp\ATP\atp_matches_report.html"
options = AnalyzeOptions()
options.delimiter = ","
options.ignoreEmptyLines = True
options.emptyStringIsNull = True
analyzer = Analyzer(srcFile, options)
analyzer.generateReport(expFile)

Configuration

The SimpleDataQualityAnalyzer.DomainObjects.AnalyzeOptions object has the following configuraiton options:

Property Default Description
delimiter , The character that separates the columns
ignoreEmptyLines True If emtpy lines shall be ignored
emptyStringIsNull True If emtpy string values shall be null
placeholderNull ["", " "] String that represent null values
placeholderTrue ["Y", "y"] String values that represent true
placeholderFalse ["N", "n"] String values that represent false

At the moment a dataset needs to have the header (column names) in the first row of the data. The AnalyzeOptions will be extended in the next version of the package.

HTML Report

The generated HTML report consists of three main parts that provide information about the dataset that was analyzed.

Part 1 - File Overview

Part2 This part contains the main information about the dataset that was scanned like:

  1. The name of the dataset (if not provided it will be derived from the file name)
  2. The location of the source dataset that has been analyzed
  3. The date and time information when the report was generated
  4. The number of records (lines) found in the dataset

Part 2 - Dataset Overview

Part2 The second part of the report contains a table that provides information about each column found in the file which are:

  1. The position of the column within the dataset
  2. The name derived from the first line in the CSV file (header)
  3. The infered datatype
  4. The number of Non-Null values
  5. The number of Null values
  6. The number of Unique values
  7. The number of Distinct values
  8. The Min value (depends on the datatype)
  9. The Median value (depends on the datatype)
  10. The Max value (depends on the datatype) The table is searchable and sortable and when you click on a row in the table it updates the 3rd part of the report that contains the specific detail information about the column of the dataset selected.

Part 3 - Column Details (Basic)

Part3 The third part provides detail information about the column that was seledcted in the second report part. The first row shows counts values. It consists of a bar chart that shows the counts in the four value categories Null, Duplicate, Non-Unique and Unique. On the right hand side of the chart is a table that describes the category hierarchy in detail.

The second row shows statics values. The first table shows the Min, Median and Max value of the selected column with it's frequency. The second table shows the Min, Median, Avg and Max length of the values in the corresponding column (depends on datatype).

Part 3 - Column Details (Frequency)

Part3 In the tab Frequency you'll find a complete frequency table with all values within the selected column of the dataset and it's frequency absolute and in percent.

Contact

If you have feedback about the package, feature requests or if you have discovered bugs please don't hesitate to share them with me: https://gitlab.com/debugair/simpledataqualityanalyzer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SimpleDataQualityAnalyzer-1.0.0.4.tar.gz (44.2 kB view hashes)

Uploaded Source

Built Distribution

SimpleDataQualityAnalyzer-1.0.0.4-py3-none-any.whl (45.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page