Skip to main content

diver is a series of tools to speed up common feature-set investigation, conditioning and encoding for common ML algorithms

Project description

DIVER

Diver is the Dataset Inspector, Visualiser and Encoder library, automating and codifying common data science project steps as standardised and reusable methods.

See example-notebooks/house-price-demo.ipynb for a full walkthrough.

dataset_inspector

A set of functions which help perform checks for common dataset issues which can impact machine learning model performance.

inspector flow

dataset_conditioner

A scikit-learn-formatted module which can perform various data-type encodings in a single go, and save the associated attributes from a train-set encoding to reuse on a test-set encoding:

  • The .fit_transform method learns various encodings (feature means and variances; categorical feature elements - yellow in the flow chart below) and then performs the various encodings on the feature train set
  • The .transform method applies train-set encodings to a test set

fit_transform flow

dataset_visualiser

Functions for visualising aspects of the dataset

Correlation analysis

  • Display the correlation matrix for the top n correlating features (n specified by the user) against the dependent variable (at the bottom row of the matrix)

correlation

Future Work

categorical_excess_cardinality_flagger_and_reducer

  • Option for instances where there are no categorical features

missing_value_conditioner

ordinal_encoder

  • Create a function to do this

timestamp_encoder

  • is_public_holiday : bool
  • Update above diagram

Remove warnings

Make robust to non-consecutive indices in input df

Unit test all functions

Extreme values

PCA option?

Label balanced class checker (for classification problems)

Distribution and correlation analysis

  • Display correlation matrix for top n correlates alongside target at the bottom
  • Display pairplot for top n correlates alongside target at the bottom
  • Or instead of top n correlates, instead threshold of cumulative variance
  • Option to DROOP lower correlates (lower than threshold) if desired

Useful reading

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diver-0.2.3.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diver-0.2.3-py3-none-any.whl (33.0 kB view details)

Uploaded Python 3

File details

Details for the file diver-0.2.3.tar.gz.

File metadata

  • Download URL: diver-0.2.3.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3

File hashes

Hashes for diver-0.2.3.tar.gz
Algorithm Hash digest
SHA256 2d68f394a7c72d180fbe2c888b95c8c536d5ab84a03e6d910fcf721713182f24
MD5 873bed890ee8738e79f56ebca6d500d9
BLAKE2b-256 1ba78ad21679d46a80ed325e1c3d937f20926a94a0bcd90fffd6abd472b4cf4a

See more details on using hashes here.

File details

Details for the file diver-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: diver-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 33.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3

File hashes

Hashes for diver-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e6dd526cc135ca21af7ce6cac6afa896e260081950a9d9db6a0b6c3d1c1b87c1
MD5 459039868fa03c9ed1de17e5003ecdb6
BLAKE2b-256 0126eecd71793d588711c3ee65bced6191d393b0d460ab90d27ac09816e9cea7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page