Skip to main content

Data exploration tools.

Project description

Datasurveyor

Author:

Nick Buker

Introduction:

Datasurveyor is a small collection of tools for exploratory data analysis. It leverages Pandas, but the tools are able to ingest either DataFrames or Series. The output is a tidy DataFrame for easy viewing of results. Currently, datasurveyor focuses on rapidly identifying data quality issues, but the scope will likely expand as the package becomes "battle tested".

Table of contents:

Installing datasurveyor:

Datasurveyor installation instructions

Using datasurveyor:

Datasurveyor use instructions

Contributing and Testing:

Installing datasurveyor:

Datasurveyor can be install via pip. As always, use of a project-level virtual environment is recommended. Note: Datasurveyor requires Python >= 3.6.

$ pip install datasurveyor

Using Datasurveyor

To demonstrate the tools available in datasurveyor, let's use a Pandas DataFrame named df.

id name state platform app_inst lylty spend
0 1 Nick WA ios True 0 0
1 2 Gina OR android True 1 nan
2 3 Rob WA ios False 0 10
3 4 Adam ID web True 1 150
4 5 Hanna WA ios True 1 12
5 6 Susan Null android False 0 0
6 7 Quentin WA ios True 1 nan
7 8 Caitlyn unknown web True 0 8
8 9 Matt WA web True 1 50
9 10 Nick WA ios True 0 -10

A data dictionary for df is below.

column dtype description
id int64 unique customer identifier
name object customer name
state object state of residence
platform object system platform
app_inst bool app installation flag
lylty int64 loyalty program flag
spend float64 total customer spend

Binary features

Description

The methods within BinaryFeatures are intended for use with binary data (data with two possible values). Datasurveyor expects binary features to be stored as bools or integers (with values of 0 or 1). In the example data, app_inst and lylty are binary features.

Importing BinaryFeatures

The binary feature tools can be imported with the command below.

from datasurveyor import BinaryFeatures as BF

Checking if all values the same

The check_all_same method can be used to check if binary features contain exclusively the same value. This method can be applied to a single binary feature or a collection of binary features.

BF.check_all_same(df['app_inst'])
all_same
0 False
BF.check_all_same(df[['app_inst', 'lylty']])
column all_same
0 app_inst False
1 lylty False

Checking if values are mostly the same

The check_mostly_same method can be used to check if binary features contain mostly the same value (default threshold 95%). This method can be applied to a single binary feature or a collection of binary features.

BF.check_mostly_same(df['app_inst'])
mostly_same thresh mean
0 False 0.95 0.8
BF.check_mostly_same(df[['app_inst', 'lylty']])
column mostly_same thresh mean
0 app_inst False 0.95 0.8
1 lylty False 0.95 0.5

The user can specify whatever threshold is appropriate for their usecase. If thresh=0.7 is applied, the method will flag features with at least 70% the same value.

BF.check_mostly_same(df['app_inst'], thresh=0.7)
mostly_same thresh mean
0 True 0.7 0.8
BF.check_mostly_same(df[['app_inst', 'lylty']], thresh=0.7)
column mostly_same thresh mean
0 app_inst True 0.7 0.8
1 lylty False 0.7 0.5

Checking the range

The check_outside_range method can be used to detect features with data outside the expected range of 0 and 1. Note that the outside of range condition is only possible for binary features encoded as integer data type.

BF.check_outside_range(df['app_inst'])
outside_range
0 False
BF.check_outside_range(df[['app_inst', 'lylty']])
column outside_range
0 app_inst False
1 lylty False

Categorical features

Description

The methods within CategoricalFeatures are intended for use with categorical data (data denoting categories). Datasurveyor expects categorical features to be stored as object (string) or integer type. In the example data, state and platform are categorical features.

Importing CategoricalFeatures

The categorical feature tools can be imported with the command below.

from datasurveyor import CategoricalFeatures as CF

Checking if values are mostly the same

The check_mostly_same method can be used to check if categorical features contain mostly the same value (default threshold 95%). This method can be applied to a single categorical feature or a collection of categorical features.

CF.check_mostly_same(df['state'])
mostly_same thresh most_common count prop
0 False 0.95 WA 6 0.6
CF.check_mostly_same(df[['state', 'platform']])
column mostly_same thresh most_common count prop
0 state False 0.95 WA 6 0.6
1 platform False 0.95 ios 5 0.5

The user can specify whatever threshold is appropriate for their usecase. If thresh=0.6 is applied, the method will flag features with at least 60% the same value.

CF.check_mostly_same(df['state'], thresh=0.6)
mostly_same thresh most_common count prop
0 True 0.6 WA 6 0.6
CF.check_mostly_same(df[['state', 'platform']], thresh=0.6)
column mostly_same thresh most_common count prop
0 state True 0.6 WA 6 0.6
1 platform False 0.6 ios 5 0.5

Checking number of categories

The n_categories method can be used to count the number of categories. This method can be applied to a single categorical feature or a collection of categorical features.

CF.check_n_categories(df['state'])
n_categories
0 4
CF.check_n_categories(df[['state', 'platform']])
column n_categories
0 state 4
1 platform 3

General features

Description

The methods within GeneralFeatures are intended for use with any data. Datasurveyor expects inputs to be of type Pandas Series or DataFrame, but has no type expectations for the data within those structures.

Importing GeneralFeatures

The general feature tools can be imported with the command below.

from datasurveyor import GeneralFeatures as GF

Checking for nulls

The check_nulls method can be used to check for nulls. This method can be applied to a single feature or a collection of features.

GF.check_nulls(df['spend'])
nulls_present null_count prop_null
0 True 2 0.2
GF.check_nulls(df)
column nulls_present null_count prop_null
0 id False 0 0
1 name False 0 0
2 state False 0 0
3 platform False 0 0
4 app_inst False 0 0
5 lylty False 0 0
6 spend True 2 0.2

Checking for nulls

The check_fuzzy_nulls method can be used to check for values that commonly denote nulls. This method can be applied to a single feature or a collection of features.

GF.check_fuzzy_nulls(df['state'])
fuzzy_nulls_present fuzzy_null_count prop_fuzzy_null
0 True 1 0.1
GF.check_fuzzy_nulls(df)
column fuzzy_nulls_present fuzzy_null_count prop_fuzzy_null
0 id False 0 0
1 name False 0 0
2 state True 1 0.1
3 platform False 0 0
4 app_inst False 0 0
5 lylty False 0 0
6 spend False 0 0

The defaults items checked for are: 'null', 'Null', 'NULL', '' (empty string), and ' ' (single space). The user can specify additional items to check for using the add_fuzzy_nulls argument.

GF.check_fuzzy_nulls(df['state'], add_fuzzy_nulls=['unknown'])
fuzzy_nulls_present fuzzy_null_count prop_fuzzy_null
0 True 2 0.2
GF.check_fuzzy_nulls(df, add_fuzzy_nulls=['unknown'])
column fuzzy_nulls_present fuzzy_null_count prop_fuzzy_null
0 id False 0 0
1 name False 0 0
2 state True 2 0.2
3 platform False 0 0
4 app_inst False 0 0
5 lylty False 0 0
6 spend False 0 0

Unique features

Description

The methods within UniqueFeatures are intended for use with data where each observation has a unique value. Datasurveyor expects unique features to be stored as datetime, object (string), or integer type. In the example data, id is a unique feature.

Importing UniqueFeatures

The unique feature tools can be imported with the command below.

from datasurveyor import UniqueFeatures as UF

Checking uniqueness

The check_uniqueness method can be used to check if potentially unique features contain unique values. This method can be applied to a single unique feature or a collection of unique features.

UF.check_uniqueness(sample_df['id'])
dupes_present dupe_count prop_dupe
0 False 0 0
UF.check_uniqueness(df[['id', 'name']])
column dupes_present dupe_count prop_dupe
0 id False 0 0
1 name True 1 0.1

Contributing to datasurveyor

If you are interested in contributing to this project:

  1. Fork the datasurveyor repo.
  2. Clone the forked repository to your machine.
  3. Create a git branch.
  4. Make changes and push them to GitHub.
  5. Submit your changes for review by creating a pull request. In order to be approved changes should include:
    • Appropriate updates to the README.md
    • Google style docstrings
    • Tests providing proper coverage of new code

Testing

For those interested in contributing to datasurveyor forking and editing the project, pytest is the testing framework used. To run the tests, create a virtual environment, install the contents of dev_requirements.txt, and run the following command from the root directory of the project. The testing scripts can be found in the tests/ directory.

$ pytest

To run tests and view coverage, use the below command:

$ pytest --cov=datasurveyor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasurveyor-0.0.1.tar.gz (13.3 kB view hashes)

Uploaded Source

Built Distribution

datasurveyor-0.0.1-py2.py3-none-any.whl (9.8 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page