EDA on sparse data for classification problems
Project description
Sparse profile - EDA on sparse data
Module to perform EDA tasks for a classification problem with sparse data
Curently takes only numeric values
Sample usage
import pandas as pd
import numpy as np
from sparse_profile import sparse_profile
df = pd.DataFrame({
'target' : [1, 1, 1, 1, 0, 0 ,0 ,0, 1, 0],
'col_1' : [1, 0, 0, 0, 0, 0, 0, 0, 0, 9],
'col_2' : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
})
sProfile = sparse_profile(df, 'target')
print(sProfile.top_gain)
Output maximum gain obtained from each column
col_2 0.422810
col_1 0.074882
dtype: float64
print(sProfile.report_sparsity)
Output percentage of zeros in column
col_1 0.8
col_2 0.1
Various sparse_profile reports can be accessed as attributes of the sparse_profile class object. List of all available attributes:
- report_sparsity: pandas dataframe, Percentage of zeros in each column
- report_distinct: pandas dataframe, Count of distinct non zero values in each column
- report_overall: pandas dataframe, Overall summary of each column (similar to pandas describe())
- report_non_zero: pandas dataframe, Summary of each column after removing zeros
- gain_df: pandas dataframe, Relative information gain at decile cutoffs for each column wrt target column
- auc_df: pandas dataframe, AUC of each column wrt target column
- top_gain: pandas dataframe, Columns sorted by maximum gain obtained from gain_df
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sparse_profile-0.1.1.tar.gz
(4.9 kB
view hashes)
Built Distribution
Close
Hashes for sparse_profile-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d92195cb0f936ff6770de53ac81f472c716a6f58fe2f9544bfe4d7e61a8a1b2d |
|
MD5 | e97ccdbf8eacedc3ab5b9bf066889c58 |
|
BLAKE2b-256 | 4eff35229ba56301a12b25cedd6f2afcd299082c040a2a20cbaf4f1ca0a829ff |