Data exploration tools.
Project description
Datasurveyor
Author:
Nick Buker
Introduction:
Datasurveyor is a small collection of tools for exploratory data analysis. It leverages Pandas, but the tools are able to ingest either DataFrames or Series. The output is a tidy DataFrame for easy viewing of results. Currently, datasurveyor focuses on rapidly identifying data quality issues, but the scope will likely expand as the package becomes "battle tested".
Table of contents:
Installing datasurveyor:
Datasurveyor installation instructions
Using datasurveyor:
Contributing and Testing:
Installing datasurveyor:
Datasurveyor can be install via pip. As always, use of a project-level virtual environment is recommended. Note: Datasurveyor requires Python >= 3.6.
$ pip install datasurveyor
Using Datasurveyor
To demonstrate the tools available in datasurveyor, let's use a Pandas DataFrame named df
.
id | name | state | platform | app_inst | lylty | spend | |
---|---|---|---|---|---|---|---|
0 | 1 | Nick | WA | ios | True | 0 | 0 |
1 | 2 | Gina | OR | android | True | 1 | nan |
2 | 3 | Rob | WA | ios | False | 0 | 10 |
3 | 4 | Adam | ID | web | True | 1 | 150 |
4 | 5 | Hanna | WA | ios | True | 1 | 12 |
5 | 6 | Susan | Null | android | False | 0 | 0 |
6 | 7 | Quentin | WA | ios | True | 1 | nan |
7 | 8 | Caitlyn | unknown | web | True | 0 | 8 |
8 | 9 | Matt | WA | web | True | 1 | 50 |
9 | 10 | Nick | WA | ios | True | 0 | -10 |
A data dictionary for df
is below.
column | dtype | description |
---|---|---|
id | int64 | unique customer identifier |
name | object | customer name |
state | object | state of residence |
platform | object | system platform |
app_inst | bool | app installation flag |
lylty | int64 | loyalty program flag |
spend | float64 | total customer spend |
Binary features
Description
The methods within BinaryFeatures
are intended for use with binary data (data with two possible values). Datasurveyor expects binary features to be stored as bools or integers (with values of 0 or 1). In the example data, app_inst
and lylty
are binary features.
Importing BinaryFeatures
The binary feature tools can be imported with the command below.
from datasurveyor import BinaryFeatures as BF
Checking if all values the same
The check_all_same
method can be used to check if binary features contain exclusively the same value. This method can be applied to a single binary feature or a collection of binary features.
BF.check_all_same(df['app_inst'])
all_same | |
---|---|
0 | False |
BF.check_all_same(df[['app_inst', 'lylty']])
column | all_same | |
---|---|---|
0 | app_inst | False |
1 | lylty | False |
Checking if values are mostly the same
The check_mostly_same
method can be used to check if binary features contain mostly the same value (default threshold 95%). This method can be applied to a single binary feature or a collection of binary features.
BF.check_mostly_same(df['app_inst'])
mostly_same | thresh | mean | |
---|---|---|---|
0 | False | 0.95 | 0.8 |
BF.check_mostly_same(df[['app_inst', 'lylty']])
column | mostly_same | thresh | mean | |
---|---|---|---|---|
0 | app_inst | False | 0.95 | 0.8 |
1 | lylty | False | 0.95 | 0.5 |
The user can specify whatever threshold is appropriate for their usecase. If thresh=0.7
is applied, the method will flag features with at least 70% the same value.
BF.check_mostly_same(df['app_inst'], thresh=0.7)
mostly_same | thresh | mean | |
---|---|---|---|
0 | True | 0.7 | 0.8 |
BF.check_mostly_same(df[['app_inst', 'lylty']], thresh=0.7)
column | mostly_same | thresh | mean | |
---|---|---|---|---|
0 | app_inst | True | 0.7 | 0.8 |
1 | lylty | False | 0.7 | 0.5 |
Checking the range
The check_outside_range
method can be used to detect features with data outside the expected range of 0 and 1. Note that the outside of range condition is only possible for binary features encoded as integer data type.
BF.check_outside_range(df['app_inst'])
outside_range | |
---|---|
0 | False |
BF.check_outside_range(df[['app_inst', 'lylty']])
column | outside_range | |
---|---|---|
0 | app_inst | False |
1 | lylty | False |
Categorical features
Description
The methods within CategoricalFeatures
are intended for use with categorical data (data denoting categories). Datasurveyor expects categorical features to be stored as object (string) or integer type. In the example data, state
and platform
are categorical features.
Importing CategoricalFeatures
The categorical feature tools can be imported with the command below.
from datasurveyor import CategoricalFeatures as CF
Checking if values are mostly the same
The check_mostly_same
method can be used to check if categorical features contain mostly the same value (default threshold 95%). This method can be applied to a single categorical feature or a collection of categorical features.
CF.check_mostly_same(df['state'])
mostly_same | thresh | most_common | count | prop | |
---|---|---|---|---|---|
0 | False | 0.95 | WA | 6 | 0.6 |
CF.check_mostly_same(df[['state', 'platform']])
column | mostly_same | thresh | most_common | count | prop | |
---|---|---|---|---|---|---|
0 | state | False | 0.95 | WA | 6 | 0.6 |
1 | platform | False | 0.95 | ios | 5 | 0.5 |
The user can specify whatever threshold is appropriate for their usecase. If thresh=0.6
is applied, the method will flag features with at least 60% the same value.
CF.check_mostly_same(df['state'], thresh=0.6)
mostly_same | thresh | most_common | count | prop | |
---|---|---|---|---|---|
0 | True | 0.6 | WA | 6 | 0.6 |
CF.check_mostly_same(df[['state', 'platform']], thresh=0.6)
column | mostly_same | thresh | most_common | count | prop | |
---|---|---|---|---|---|---|
0 | state | True | 0.6 | WA | 6 | 0.6 |
1 | platform | False | 0.6 | ios | 5 | 0.5 |
Checking number of categories
The n_categories
method can be used to count the number of categories. This method can be applied to a single categorical feature or a collection of categorical features.
CF.check_n_categories(df['state'])
n_categories | |
---|---|
0 | 4 |
CF.check_n_categories(df[['state', 'platform']])
column | n_categories | |
---|---|---|
0 | state | 4 |
1 | platform | 3 |
General features
Description
The methods within GeneralFeatures
are intended for use with any data. Datasurveyor expects inputs to be of type Pandas Series or DataFrame, but has no type expectations for the data within those structures.
Importing GeneralFeatures
The general feature tools can be imported with the command below.
from datasurveyor import GeneralFeatures as GF
Checking for nulls
The check_nulls
method can be used to check for nulls. This method can be applied to a single feature or a collection of features.
GF.check_nulls(df['spend'])
nulls_present | null_count | prop_null | |
---|---|---|---|
0 | True | 2 | 0.2 |
GF.check_nulls(df)
column | nulls_present | null_count | prop_null | |
---|---|---|---|---|
0 | id | False | 0 | 0 |
1 | name | False | 0 | 0 |
2 | state | False | 0 | 0 |
3 | platform | False | 0 | 0 |
4 | app_inst | False | 0 | 0 |
5 | lylty | False | 0 | 0 |
6 | spend | True | 2 | 0.2 |
Checking for nulls
The check_fuzzy_nulls
method can be used to check for values that commonly denote nulls. This method can be applied to a single feature or a collection of features.
GF.check_fuzzy_nulls(df['state'])
fuzzy_nulls_present | fuzzy_null_count | prop_fuzzy_null | |
---|---|---|---|
0 | True | 1 | 0.1 |
GF.check_fuzzy_nulls(df)
column | fuzzy_nulls_present | fuzzy_null_count | prop_fuzzy_null | |
---|---|---|---|---|
0 | id | False | 0 | 0 |
1 | name | False | 0 | 0 |
2 | state | True | 1 | 0.1 |
3 | platform | False | 0 | 0 |
4 | app_inst | False | 0 | 0 |
5 | lylty | False | 0 | 0 |
6 | spend | False | 0 | 0 |
The defaults items checked for are: 'null', 'Null', 'NULL', '' (empty string), and ' ' (single space). The user can specify additional items to check for using the add_fuzzy_nulls
argument.
GF.check_fuzzy_nulls(df['state'], add_fuzzy_nulls=['unknown'])
fuzzy_nulls_present | fuzzy_null_count | prop_fuzzy_null | |
---|---|---|---|
0 | True | 2 | 0.2 |
GF.check_fuzzy_nulls(df, add_fuzzy_nulls=['unknown'])
column | fuzzy_nulls_present | fuzzy_null_count | prop_fuzzy_null | |
---|---|---|---|---|
0 | id | False | 0 | 0 |
1 | name | False | 0 | 0 |
2 | state | True | 2 | 0.2 |
3 | platform | False | 0 | 0 |
4 | app_inst | False | 0 | 0 |
5 | lylty | False | 0 | 0 |
6 | spend | False | 0 | 0 |
Unique features
Description
The methods within UniqueFeatures
are intended for use with data where each observation has a unique value. Datasurveyor expects unique features to be stored as datetime, object (string), or integer type. In the example data, id
is a unique feature.
Importing UniqueFeatures
The unique feature tools can be imported with the command below.
from datasurveyor import UniqueFeatures as UF
Checking uniqueness
The check_uniqueness
method can be used to check if potentially unique features contain unique values. This method can be applied to a single unique feature or a collection of unique features.
UF.check_uniqueness(sample_df['id'])
dupes_present | dupe_count | prop_dupe | |
---|---|---|---|
0 | False | 0 | 0 |
UF.check_uniqueness(df[['id', 'name']])
column | dupes_present | dupe_count | prop_dupe | |
---|---|---|---|---|
0 | id | False | 0 | 0 |
1 | name | True | 1 | 0.1 |
Contributing to datasurveyor
If you are interested in contributing to this project:
- Fork the datasurveyor repo.
- Clone the forked repository to your machine.
- Create a git branch.
- Make changes and push them to GitHub.
- Submit your changes for review by creating a pull request. In order to be approved changes should include:
- Appropriate updates to the
README.md
- Google style docstrings
- Tests providing proper coverage of new code
- Appropriate updates to the
Testing
For those interested in contributing to datasurveyor forking and editing the project, pytest is the testing framework used. To run the tests, create a virtual environment, install the contents of dev_requirements.txt
, and run the following command from the root directory of the project. The testing scripts can be found in the tests/
directory.
$ pytest
To run tests and view coverage, use the below command:
$ pytest --cov=datasurveyor
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datasurveyor-0.0.1.tar.gz
.
File metadata
- Download URL: datasurveyor-0.0.1.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.22.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fa234a49569e9985ab8a1123728c50bb8e029873dace72dfbf5888a780f0c73 |
|
MD5 | a300e7c16947e790386e129723d4edc8 |
|
BLAKE2b-256 | af00de0c1960a789ed6310b04be461ba5aaa165f52198f274d21d91cdb78b162 |
File details
Details for the file datasurveyor-0.0.1-py2.py3-none-any.whl
.
File metadata
- Download URL: datasurveyor-0.0.1-py2.py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.22.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70c0038396a3d362889f9bdce605b4751599afaa2ba65f61832beca7f0158cc0 |
|
MD5 | bc5603b2f6b3cc4914afef743c964b93 |
|
BLAKE2b-256 | 84e11156f8ae4460d80dd9319508f714e9937761769e733ce0b8ba70a0ed6ce9 |