diver is a series of tools to speed up common feature-set investigation, conditioning and encoding for common ML algorithms
Project description
DIVER
Diver is the Dataset Inspector, Visualiser and Encoder library, automating and codifying common data science project steps as standardised and reusable methods.
See example-notebooks/house-price-demo.ipynb for a full walkthrough.
dataset_inspector
A set of functions which help perform checks for common dataset issues which can impact machine learning model performance.
dataset_conditioner
A scikit-learn-formatted module which can perform various data-type encodings in a single go, and save the associated attributes from a train-set encoding to reuse on a test-set encoding:
- The
.fit_transformmethod learns various encodings (feature means and variances; categorical feature elements - yellow in the flow chart below) and then performs the various encodings on the feature train set - The
.transformmethod applies train-set encodings to a test set
dataset_visualiser
Functions for visualising aspects of the dataset
Correlation analysis
- Display the correlation matrix for the top
ncorrelating features (nspecified by the user) against the dependent variable (at the bottom row of the matrix)
Future Work
categorical_excess_cardinality_flagger_and_reducer
- Option for instances where there are no categorical features
missing_value_conditioner
-
Choose between either {use means from train set (default), calculate means for test set}
-
Missing values for categorical features
-
Implement missing value imputation: https://measuringu.com/handle-missing-data/
-
GOOD READING: https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
ordinal_encoder
- Create a function to do this
timestamp_encoder
- is_public_holiday : bool
- Update above diagram
Remove warnings
Make robust to non-consecutive indices in input df
Unit test all functions
Extreme values
PCA option?
Label balanced class checker (for classification problems)
Distribution and correlation analysis
- Display correlation matrix for top
ncorrelates alongside target at the bottom - Display pairplot for top
ncorrelates alongside target at the bottom - Or instead of
top ncorrelates, instead threshold ofcumulative variance - Option to DROOP lower correlates (lower than threshold) if desired
Useful reading
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diver-0.2.3.tar.gz.
File metadata
- Download URL: diver-0.2.3.tar.gz
- Upload date:
- Size: 29.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d68f394a7c72d180fbe2c888b95c8c536d5ab84a03e6d910fcf721713182f24
|
|
| MD5 |
873bed890ee8738e79f56ebca6d500d9
|
|
| BLAKE2b-256 |
1ba78ad21679d46a80ed325e1c3d937f20926a94a0bcd90fffd6abd472b4cf4a
|
File details
Details for the file diver-0.2.3-py3-none-any.whl.
File metadata
- Download URL: diver-0.2.3-py3-none-any.whl
- Upload date:
- Size: 33.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6dd526cc135ca21af7ce6cac6afa896e260081950a9d9db6a0b6c3d1c1b87c1
|
|
| MD5 |
459039868fa03c9ed1de17e5003ecdb6
|
|
| BLAKE2b-256 |
0126eecd71793d588711c3ee65bced6191d393b0d460ab90d27ac09816e9cea7
|