diver is a series of tools to speed up common feature-set investigation, conditioning and encoding for common ML algorithms
Project description
DIVER
Diver
is the D
ataset I
nspector, V
isualiser and E
ncoder
library, automating and codifying common data science project steps as standardised and reusable methods.
See example-notebooks/house-price-demo.ipynb
for a full walkthrough.
dataset_inspector
A set of functions which help perform checks for common dataset issues which can impact machine learning model performance.
dataset_conditioner
A scikit-learn
-formatted module which can perform various data-type encodings in a single go, and save the associated attributes from a train-set encoding to reuse on a test-set encoding:
- The
.fit_transform
method learns various encodings (feature means and variances; categorical feature elements - yellow in the flow chart below) and then performs the various encodings on the feature train set - The
.transform
method applies train-set encodings to a test set
dataset_visualiser
Functions for visualising aspects of the dataset
Correlation analysis
- Display the correlation matrix for the top
n
correlating features (n
specified by the user) against the dependent variable (at the bottom row of the matrix)
Future Work
categorical_excess_cardinality_flagger_and_reducer
- Option for instances where there are no categorical features
missing_value_conditioner
-
Choose between either {use means from train set (default), calculate means for test set}
-
Missing values for categorical features
-
Implement missing value imputation: https://measuringu.com/handle-missing-data/
-
GOOD READING: https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
ordinal_encoder
- Create a function to do this
timestamp_encoder
- is_public_holiday : bool
- Update above diagram
Remove warnings
Make robust to non-consecutive indices in input df
Unit test all functions
Extreme values
PCA option?
Label balanced class checker (for classification problems)
Distribution and correlation analysis
- Display correlation matrix for top
n
correlates alongside target at the bottom - Display pairplot for top
n
correlates alongside target at the bottom - Or instead of
top n
correlates, instead threshold ofcumulative variance
- Option to DROOP lower correlates (lower than threshold) if desired
Useful reading
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.