Skip to main content

diver is a series of tools to speed up common feature-set investigation, conditioning and encoding for common ML algorithms

Project description

DIVER

Diver is the Dataset Inspector, Visualiser and Encoder library, automating and codifying common data science project steps as standardised and reusable methods.

See example-notebooks/house-price-demo.ipynb for a full walkthrough.

dataset_inspector

A set of functions which help perform checks for common dataset issues which can impact machine learning model performance.

inspector flow

dataset_conditioner

A scikit-learn-formatted module which can perform various data-type encodings in a single go, and save the associated attributes from a train-set encoding to reuse on a test-set encoding:

  • The .fit_transform method learns various encodings (feature means and variances; categorical feature elements - yellow in the flow chart below) and then performs the various encodings on the feature train set
  • The .transform method applies train-set encodings to a test set

fit_transform flow

dataset_visualiser

Functions for visualising aspects of the dataset

Correlation analysis

  • Display the correlation matrix for the top n correlating features (n specified by the user) against the dependent variable (at the bottom row of the matrix)

correlation

Future Work

categorical_excess_cardinality_flagger_and_reducer

  • Option for instances where there are no categorical features

missing_value_conditioner

ordinal_encoder

  • Create a function to do this

timestamp_encoder

  • is_public_holiday : bool
  • Update above diagram

Remove warnings

Make robust to non-consecutive indices in input df

Unit test all functions

Extreme values

PCA option?

Label balanced class checker (for classification problems)

Distribution and correlation analysis

  • Display correlation matrix for top n correlates alongside target at the bottom
  • Display pairplot for top n correlates alongside target at the bottom
  • Or instead of top n correlates, instead threshold of cumulative variance
  • Option to DROOP lower correlates (lower than threshold) if desired

Useful reading

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diver-0.1.2.tar.gz (25.5 kB view hashes)

Uploaded Source

Built Distribution

diver-0.1.2-py3-none-any.whl (28.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page