A dataframe auditor that extracts descriptive statistics from dataframe columns
Project description
still in an early development stage and undergoing significant changes regularly
dataframe-auditor
A dataframe auditor that computes a number characteristics of the data.
Summary
Data profiling is important in data analysis and analytics, as well as in determining characteristics of data pipelines. This repository aims to provide a means to extract a selection of attributes from data.
It is currently focused on processing pandas dataframes, but this functionality is being extended to spark dataframes too.
Given a pandas dataframe, the extracted values are (where object and category types are mapped to string, and all numerical types to numeric):
Type | Measure |
---|---|
String & Numeric | Percentage null |
String | Distinct counts |
Most frequent categories | |
Numeric | Mean |
Standard deviation | |
Variance | |
Min value | |
Max value | |
Range | |
Kurtosis | |
Skewness | |
Kullback-Liebler divergence | |
Mean absolute deviation | |
Median | |
Interquartile range | |
Percentage zero values | |
Percentage nan values |
Naturally, many of these characteristics are not independent of one another, but some may be excluded as suits the application.
The result of auditing a dataframe using this library is that a dictionary of these measures is returned for each column in the dataframe.
For example, if a dataframe consists of a single column, named trivial, where all values are 1
, then
[{
"attr": "trivial",
"type": "NUMERIC",
"median": 1.0,
"variance": 0.0,
"std": 0.0,
"max": 1,
"min": 1,
"mad": 0.0,
"p_zeros": 0.0,
"kurtosis": 0,
"skewness": 0,
"iqr": 0.0,
"range": 0,
"p_nan": 0.0,
"mean": 1.0
}]
For a dataframe with columns ["trivial", "non-trivial"]
, a list of dictionaries is returned:
[{
"attr": "trivial"
},
{
"attr": "non-trivial"
}]
Installation
-
Dependencies are contained in
requirements.txt
:pip install -r requirements.txt
-
Alternatively, if you wish to install directly from github, you may use:
pip install git+https://github.com/jackdotwa/dataframe-auditor.git
Testing
- Unittests may be run via:
python -m unittest discover tests
- Code coverage may be determined via:
coverage run -m unittest discover tests && coverage report
Usage
Many examples of using this package is:
import pandas as pd
import dfauditor
numeric_data = {
'x': [50, 50, -10, 0, 0, 5, 15, -3, None, 0],
'y': [0.00001, 256.128, None, 16.32, 2048, -3.1415926535, 111, 2.4, 4.8, 0.0],
'trivial': [1]*10
}
numeric_df = pd.DataFrame(numeric_data)
result_dict = dfauditor.audit_dataframe(numeric_df, nr_processes=3)
Contributions
Pull requests are always welcome.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file spatialedge_analytics_dfauditor-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: spatialedge_analytics_dfauditor-0.0.5-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5decd5a59fa43da13fe352ed9657d4affb723932b8cc482953dd927eb930803 |
|
MD5 | d900ed977671b9046b52893131a4562e |
|
BLAKE2b-256 | db1308a8c58f762389cdc1b6da347a4b7380b615e79c7b80416981d78857802e |