Skip to main content

Kesh Utils for Data science/EDA/Data preparation

Project description

Chart + Util = Chartil (Click to expand)

Chart + Util = Chartil

Data visualization: Simple, Single unified API for plotting and charting

During EDA/data preparation we use few common and fixed set of chart types to analyse the relation among various features. Few are simple charts like univariate and some are complex 3D or even multiple features>3.

This api is simple, single api to plot various type of relations which will hide all the technical/code details from Data Science task and approch. This overcomes the difficulties of maintaining several api or libraries and avoid repeated codes.

Using this approach we just need one api (Rest all decided by library)

from KUtils.eda import chartil

chartil.plot(dataframe, [list of columns]) or
chartil.plot(dataframe, [list of columns], {optional_settings})

Demo code:

Load UCI Dataset. Download From here

heart_disease_df = pd.read_csv('../input/uci/heart.csv')

Quick data preparation

column_to_convert_to_categorical = ['target', 'cp', 'fbs', 'exang', 'restecg', 'slope', 'ca', 'thal']
for col in column_to_convert_to_categorical:
	heart_disease_df[col] = heart_disease_df[col].astype('category')

heart_disease_df['age_bin'] = pd.cut(heart_disease_df['age'], [0, 32, 40, 50, 60, 70, 100], labels=['<32', '33-40','41-50','51-60','61-70', '71+'])   

heart_disease_df['sex'] = heart_disease_df['sex'].map({1:'Male', 0:'Female'})

heart_disease_df.info()

Heatmap

chartil.plot(heart_disease_df, heart_disease_df.columns) # Send all column names 

Heatmap Numerical

chartil.plot(heart_disease_df, heart_disease_df.columns, optional_settings={'include_categorical':True} ) 

Heatmap With categorical

chartil.plot(heart_disease_df, heart_disease_df.columns, optional_settings={'include_categorical':True, 'sort_by_column':'trestbps'} ) 

Heatmap With categorical and ordered by a column

# Force to plot heatmap when you have fewer columns, otherwise tool will decide as different chart
chartil.plot(heart_disease_df, ['chol', 'thalach', 'trestbps'], chart_type='heatmap') 

forced_heatmap

Uni-categorical

chartil.plot(heart_disease_df, ['target']) # Barchart as count plot 

Uni Categorical

Uni-Continuous

chartil.plot(heart_disease_df, ['age'])

Uni boxplot

chartil.plot(heart_disease_df, ['age'], chart_type='barchart') # Force barchart on cntinuous by auto creating 10 equal bins 

Uni barchart_forced

chartil.plot(heart_disease_df, ['age'], chart_type='barchart', optional_settings={'no_of_bins':5}) # Create custom number of bins 

Uni uni_barchart_forced_custom_bin_size

chartil.plot(heart_disease_df, ['age'], chart_type='distplot') 

Uni distplot

Uni-categorical with optional_settings

chartil.plot(heart_disease_df, ['age_bin']) # Barchart as count plot

Uni distplot

chartil.plot(heart_disease_df, ['age_bin'], optional_settings={'sort_by_value':True})

Uni distplot

chartil.plot(heart_disease_df, ['age_bin'], optional_settings={'sort_by_value':True, 'limit_bars_count_to':5})

Uni distplot

Bi Category vs Category (& Univariate Segmented)

chartil.plot(heart_disease_df, ['sex', 'target'])

Bi Category

chartil.plot(heart_disease_df, ['sex', 'target'], chart_type='crosstab')

Bi Category

chartil.plot(heart_disease_df, ['sex', 'target'], chart_type='stacked_barchart')

Bi Category

Bi Continuous vs Continuous

chartil.plot(heart_disease_df, ['chol', 'thalach']) # Scatter plot

Bi Continuous scatter

Bi Continuous vs Category

chartil.plot(heart_disease_df, ['thalach', 'sex']) # Grouped box plot (Segmented univariate)

Bi continuous_catergory_box

chartil.plot(heart_disease_df, ['thalach', 'sex'], chart_type='distplot') # Distplot

Bi continuous_catergory_distplot

Multi 3 Continuous

chartil.plot(heart_disease_df, ['chol', 'thalach', 'trestbps']) # Colored 3D scatter plot

3 Continuous 3D

Multi 3 Categorical

chartil.plot(heart_disease_df, ['sex', 'age_bin', 'target']) # Paired barchart

3 paired_3d_grouped_barchart

Multi 2 Continuous, 1 Category

chartil.plot(heart_disease_df, ['chol', 'thalach', 'target']) # Scatter plot with colored groups 

Grouped Scatter plot

Multi 1 Continuous, 2 Category

chartil.plot(heart_disease_df, ['thalach', 'sex', 'target']) # Grouped boxplot

Grouped 1continuous_2category_boxplot

chartil.plot(heart_disease_df, ['thalach', 'sex', 'target'], chart_type='violinplot') # Grouped violin plot

Grouped 1continuous_2category_violinplot

Multi 3 Continuous, 1 category

chartil.plot(heart_disease_df, ['chol', 'thalach', 'trestbps', 'target']) # Group Color highlighted 3D plot

Grouped 3d_scatter

Multi 3 category, 2 Continuous

chartil.plot(heart_disease_df, ['sex','cp','target','thalach','trestbps']) # Paired scatter plot

Grouped Paired_3d_grouped_scatter

Full working demo available on kaggle here

Auto Linear Regression (Click to expand)

Auto Linear Regression

We have seen Auto ML like H2O which is a blackbox approach to generate models.

During our model building process, we try with brute force/TrialnError/several combinations to come up with best model. However trying these possibilities manually is a laborious process. In order to overcome or atleast have a base model automatically I developed this auto linear regression using backward feature elimination technique.

The library/package can be found here and source code here

How Auto LR works?

We throw the cleaned dataset to autolr.fit(<>) The method will

  • Treat categorical variable if applicable(dummy creation/One hot encoding)
  • First model - Run the RFE on dataset
  • For remaining features elimination - it follows backward elimination - one feature at a time
    • combination of vif and p-values of coefficients (Eliminate with higher vif and p-value combination
    • vif only (or eliminate one with higher vif)
    • p-values only (or eliminate one with higher p-value)
  • Everytime when a feature is identified we build new model and repeat the process
  • on every iteration if adjusted R2 affected significantly, we re-add/retain it and select next possible feature to eliminate.
  • Repeat until program can't proceed further with above logic.

Auto Linear Regression Package/Function details

The method autolr.fit() has below parameters

  • df, (The full dataframe)
  • dependent_column, (Target column)
  • p_value_cutoff = 0.01, (Threashold p-values of features to use while filtering features during backward elimination step, Default 0.01)
  • vif_cutoff = 5, (Threashold co-relation of vif values of features to use while filtering features during backward elimination step, Default 5)
  • acceptable_r2_change = 0.02, (Restrict degradtion of model efficiency by controlling loss of change in R2, Default 0.02)
  • scale_numerical = False, (Flag to convert/scale numerical fetures using StandardScaler)
  • include_target_column_from_scaling = True, (Flag to indiacte weather to include target column from scaling)
  • dummies_creation_drop_column_preference='dropFirst', (Available options dropFirst, dropMax, dropMin - While creating dummies which clum drop to convert to one hot)
  • train_split_size = 0.7, (Train/Test split ration to be used)
  • max_features_to_select = 0, (Set the number of features to be qualified from RFE before entring auto backward elimination)
  • random_state_to_use=100, (Self explanatory)
  • include_data_in_return = False, (Include the data generated/used in Auto LR which might have gobne thru scaling, dummy creation etc.)
  • verbose=False (Enable to print detailed debug messgaes)

Above method returns 'model_info' dictionary which will have all the details used while performing auto fit.

Full working demo available on kaggle here

Clustered Linear Regression (Click to expand)

Clustered Linear Regression

For a linear regression approach we try to fit a best model on entire dataset. However often we have seen within dataset based on a particular feature the dataset behaves totally different and single model is not the best solutions, instead have multiple model which applied on different subset or filtered data does better.

How to find the feature which splits the dataset into multiple sub dataset (and there after build and apply different models)

There is no easy solution, instead use trial and error or brute force to subset data on different feature and build multiple model. This clustred or grouped Linear Regression does the same. You send the entire dataset and specifiy list of columns to separate the dataset individually and return the kpi measures like rmse or r2 etc and then decide which way to go.

How "Clustered Linear Regression" works?

  • First it lists possible combinations
  • For each possible combinations split the data into subset
  • For each subset execute the Auto Linear Regression. Check previous kaggle post on this.
  • Return summary or consolidated kpi measures at group level.

The API clustlr.fit() has below parameters

  • data_df (Full dataset)
  • feature_group_list (List of column on which filter and group the data
  • dependent_column (The target column)
  • max_level = 2 (When it is 2 it uses two feature combination to filter)
  • min_leaf_in_filtered_dataset=1000 (Condition the minimum datapoints in subgroup without which autolr will not be executed)
  • no_of_bins_for_continuous_feature=10 (number of bins to be created when you use continuous varibale for grouping)
  • verbose (Use True if you want detailed debug/log message)

Full working demo available on kaggle here

Auto Logistic Regression (Click to expand)

Auto Logistic Regression

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kesh-utils-0.4.5.tar.gz (33.4 kB view hashes)

Uploaded Source

Built Distribution

kesh_utils-0.4.5-py3-none-any.whl (38.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page