initial testing

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

SPD's collection of EDA tools

For use on pandas dataframes

`DataframeStructure`

Use the DataframeStructure class for quick structural analysis on a pandas dataframe.

from spd_eda import DataframeStructure

DataframeStructure(df)

Arguments:

df: dataframe to be reviewed

Class Methods:

thread_summary(thread): pass a "thread" (i.e. list of columns) to see distribution of row counts within each thread
thread_examples(thread, recs_per_thread, num_examples): return examples illustrating to investigate

Sample usage:

from spd_eda import DataframeStructure

# define the object
df_struct = DataframeStructure(df)

# see record count distribution by THREAD
THREAD = ['Policy']
df_struct.thread_summary(THREAD)

# see examples for specified distributions
df_struct.thread_examples(THREAD, recs_per_thread=3, num_examples=5)

`Binning Methods`

Use the return_binned_series function to return binned version of a pandas series

Function arguments:

s: pandas series
bin_info: dictionary specifying how the binning should be done. See below for binning options
fallback_text (optional): for categorical binning, specify the "fallback" values

Bin-info options:

n-tiles: Roughly evenly spaced: bin_info = ('num_ntiles', 5) would try to create 5 bins of equal size (e.g. similar to pd.cut())
percentiles: Specify the bin percentages: bin_info = ('pctile_list', [0, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0]) would try to create bins with 10/15/25/25/15/10 volumes (e.g. similar to pd.qcut() with specified percentiles)
fixed cutpoints: Specify the LEFT end-points of intervals (inclusive): bin_info = ('fixed_cutpoints', [0, 39, 40, 43]) would try to create bins beginning at 0, 39, 40 and 43 (e.g. similar to pd.cut() with specified bins & right=False)
categorical values: specify the stand-alone values: bin_info = ('cat_stand_alone_values', ['A', 'B', 'C']) would keep A-C as stand-alone values & all other values would be combined as "everything else"

Return Value: A new Series using the binned values. Note that the new Series has the same name as the original series. Users will need to be careful when using the results to avoid ambiguity with the original (un-binned) series.

Peril Analysis

Use the PerilAnalysis class to interactively explore data for custom aggregation func tions.

Arguments:

df: dataframe
column_list_to_consider: list of columns to analyze (typically a subset of the dataframe's columns)
peril_abbr: abbreviation for the peril
peril_cc: name of claim count column
peril_il: name of loss amount column
stat_function: name of aggregation function to use in analysis
stat_function_base_cols: list of columns that aggregation function requires that aren't already passed to stat_function as arguments
signal_metric_col: column produced by stat_function that is the primary metric of interest (used for signal calculation and y-axis charting)
signal_volume_col: column produced by stat_function that is the weighting value (used for signal calculation and y-axis charting)

Methods:

export_peril_exhibits(name, col_type_dict, col_cond_format_list): class method to export results to excel file
inspect_ftr(ftr_name): class method to inspect results of particular column in the analysis
update_ftr(ftr_name): class method to UPDATE the results of a particular column, usually by modifying the binning. This will update the underlying stats/charts & would be reflected in subsequent exports to excel.

Sample usage:

from spd_eda import PerilAnalysis

# 1) define the object
analysis = PerilAnalysis(
    df = df,
    column_list_to_consider = ['C1', 'C2', 'C3'],
    
    # columns related to the specific peril
    peril_abbr = 'Test',
    peril_cc = 'CLM_ClaimCount_Test',  # claim count column for the peril
    peril_il = 'CLM_Incurred_Test',  # loss column for the peril
    stat_function = my_stats_fcn,  # aggregation function
    stat_function_base_cols = ['Policy', 'Exposure', 'Premium'],     # columns that aggregation function refers to
    signal_metric_col = 'Frequency_Test',  # desired column (output of aggregation function) for signal calculations (also used for plotting line chart)
    signal_volume_col='house_years',  # desired column (output of aggregation function) for signal weighting (i.e. volume).  Also used for bar charts.
    
)

# 2) write results to excel
# optional number formatting for the excel file
col_type_dict = {
    'house_years': '#,##0',
    'Frequency_Test': '0.00%',
    'Avg_Premium': '#,##0'
}

# conditional formatting
col_cond_format_list = ['house_years', 'Frequency_Test']

# export it
analysis.export_peril_exhibits("TestExport.xlsx", col_type_dict=col_type_dict, col_cond_format_list=col_cond_format_list)

# 3) inspect individual variable results
analysis.inspect_ftr('C1')

# 4) If desired, UPDATE variable results by changing binning logic
analysis.update_ftr('C1', ftr_bin_info=('fixed_cutpoints', [0,1,2,3,4,5]))

# 5) re-write results to excel to incorporate the update
analysis.export_peril_exhibits("TestExport_updated.xlsx", col_type_dict=col_type_dict, col_cond_format_list=col_cond_format_list)

`DataFrameEDA`

Use the DataFrameEda class for quick analysis on a pandas dataframe.

from spd_eda import DataframeEda

DataframeEda(
    df, agg_fcn,
    control_var_list=[], ord_bin_threshold=100, ord_bin_count=5, cat_value_limit=30
    )

Arguments:

df: dataframe to be reviewed
ag_fcn: user-defined aggregation function that can be applied to the dataframe
control_var_list (optional): list of additional one-way variables for control totals (defaults to empty list)
ord_bin_threshold (optional): maximum number of distinct numeric values before bucketing will occur (default=100)
ord_bin_count (optional): If ordinal bucketing occurs, determines number of pd.cut() buckets (default=5)
cat_value_limit (optional): maximum number of distinct categorical values before bucketing will occur (default=30)
calculate_signals (optional): boolean for whether to calculate signals (based on agg_fcn) for each variable
signal_weight_col (optional): if signals are calculated, this is the column to use for weighting. If omitted, will use the first column from the agg_fcn
calculate_cramers_v (optional): boolean for whether to calculate cramers V against each of the control variables (from control_var_list)
start_with_user_provided_summary_df (optional): boolean to specify if user will provide the summary dataframe. If False, will use pd.describe()
user_provided_summary_df (optional): if user providing summary dataframe, column names MUST be in the index.

`ExcelExport`

Use the ExcelExport class to write results of the DataFrameEda into excel.

from spd_eda import ExcelExport

ExcelExport(
    filename, control_df_list, col_summary_df, var_info_dict,
    col_type_dict={}, col_cond_format_list=[]
    )

Arguments:

filename: name of excel file to be created (e.g. 'sample_output.xlsx')
control_df_list: a list of dataframes to include on the 'control' tab of the output
col_summary_df: single dataframe with one record per column... first column should be the variable name
var_info_dict: dictionary where keys are column names & values are one-way stats for the variable (using the agg function)
col_type_dict (optional): dictionary to control number formatting in excel
col_cond_format_list (optional): list of columns to conditionally format in the excel

Sample usage:

from spd_eda import DataframeEda, ExcelExport

# define an aggregation function
def my_agg_fcn(df):
    return df.agg({'fid': 'count', 'building_a': 'mean',})

# create EDA object
eda_obj = DataframeEda(df, my_agg_fcn, control_var_list=['city'])

# formatting options
col_type_dict = {
        'fid': '#,##0',
        'building_a': '#,##0',
        }
col_cond_format_list = ['fid']

# export it
ExcelExport('EDA_station_16_addresses.xlsx',
            eda_obj.control_df_list, eda_obj.var_summary, eda_obj.var_info_dict,
            col_type_dict=col_type_dict, col_cond_format_list=col_cond_format_list)

For use on PA/Predict entities

Must be running from a machine with database access.

`DataViewEda`

The DataViewEda class for doing quick analysis on a data view in the Predict software.

DataViewEda(
    db_name, dv_name,
    secondary_var_list=[], bin_override_dict={}
    )

Arguments:

db_name: database name
dv_name: data view name
secondary_var_list (optional): list of secondary fields that be part of two-way comparisons for all data elements
bin_override_dict (optional): dictionary of binning overrides (default={})

Once object is instantiated, use the export_summary() method to create formatted output in excel.

export_summary(
    filename, obj_to_format,
    stats_formatting=INS_STATS_FORMATTING_DICT
    )

Arguments:

filename: name of excel file to be created (e.g. 'dv_summary.xlsx')
obj_to_format: list of columns to conditionally format (e.g. ['EExp', 'LR'])
stats_formatting (optional): dictionary of column formats for the output. Has sensible defaults already, but format is like this: {'Freq': '0.00', 'Sev': '#,##0', 'LR': '0.00%'}

Sample usage:

from spd_eda import DataViewEda

test_dv_eda = DataViewEda(
        'GuidewireResearchCM_Insight',
        'eda_testing',
        secondary_var_list=['ax_year', 'LOSSSTATE'],
        )

test_dv_eda.export_summary("DataViewEDA_eda_testing.xlsx", obj_to_format=['LR'])

`AnalysisSummary`

The AnalysisSummary class supplements the analysis that comes from the Predict software, producing a file of the form ModelSummary_<analysis_id>.xlsx.

AnalysisSummary(
    db_name, analysis_id,
    geo_col='LOSSSTATE', naics_col='LOB',
    objectives_to_format=[], stats_formatting=INS_STATS_FORMATTING_DICT
    )

Arguments:

db_name: database name
analysis_id: analysis ID
geo_col: column name already included in the data view to use for geography (will be used in two-way exhibits)
naics_col: column name already included in the data view to use for naics (can be anything... will just be used for two-ways)
objectives_to_format (optional): list of columns to conditionally format (e.g. ['EExp', 'LR'])
stats_formatting (optional): dictionary of column formats for the output. Has sensible defaults already, but format is like this: {'Freq': '0.00', 'Sev': '#,##0', 'LR': '0.00%'}

Sample usage:

from spd_eda import AnalysisSummary

AnalysisSummary(
    db_name='GuidewireCooperativeModelsClaimsHO_Insight',
    analysis_id='711b0a41-a49b-46df-8dff-68dd04776520',
    geo_col='LOSSSTATE', naics_col='LOB', objectives_to_format=['Freq']
    )

`Plinko`

The Plinko class can be used to trace an individual policy through a Boosting model from Predict.

Note that the Plinko model references the Analysis class, which is briefly defined below.

Analysis(db_name, analysis_id)

Arguments:

db_name: database name
analysis_id: analysis ID

Plinko(model, policyno)

Arguments:

model: an Analysis object corresponding to a Boosting model
policyno: a specific POLICYNO value from the original modeling dataset (i.e. needs the model variable for the model)

Properties:

plinko_df: dataframe that traces the specific example through all the iterations of the boosting model

Sample usage:

from spd_eda import Analysis, Plinko

# create Analysis object
model = Analysis(
    db_name='GuidewireCooperativeModelsClaimsHO_Insight',
    analysis_id='711b0a41-a49b-46df-8dff-68dd04776520'
    )

# create specific plinko object
plinko_obj = Plinko(model, '63_155334')

# review the trace file
plinko_obj.plinko_df

For use on Athena tables

Must have appropriate AWS credentials to connect.

`AthenaTable`

Use the AthenaTable class for doing quick analysis on tables in AWS Athena

AthenaTable(db_name, tbl_name)

Arguments:

db_name: database name
tbl_name: table name

Available properties:

row_count: integer value with number of rows in the table
information_schema: dataframe with metadata on the columns in the table
col_info: dictionary of key: value pairs, where key is the column name & value is dataframe of (binned) record counts

Available methods:

get_sample_records(num_rows=10, filter_string="1=1", col_subset_list=[]) - returns dataframe
get_row_count(filter_string="1=1") - returns row count, note the optional filter
get_custom_counts(custom_expression) - returns record counts based on the provided expression
get_records_per_thread_summary(thread_list, filter_string="1=1") - Define a "thread" as set of columns, provides value distribution of # records that exist within each thread (useful for finding keys)
get_thread_examples_with_specified_num_records(thread_list, records_per_thread, num_thread_examples=1, filter_string="1=1") - returns dataframe with examples of "threads" with the desired number of "records-per-thread". Useful in conjunction with get_records_per_thread_summary()
get_column_info() - This populates the col_info attribute... can take a long time to run.
write_summary(filename) - creates excel file summarizing the table

Sample Usage:

from spd_eda import AthenaTable

EDA_osha_accident_injury_raw = AthenaTable('assess_db', 'osha_accident_injury_raw')

# distributions by column & expression
EDA_osha_accident_injury_raw.get_custom_counts("degree_of_inj")
EDA_osha_accident_injury_raw.get_custom_counts("SUBSTRING (load_dt, 1, 4)")

# thread hunting
EDA_osha_accident_injury_raw.get_records_per_thread_summary(['summary_nr'])
EDA_osha_accident_injury_raw.get_records_per_thread_summary(['summary_nr', 'injury_line_nr'])
EDA_osha_accident_injury_raw.get_thread_examples_with_specified_num_records(['summary_nr', 'injury_line_nr'], 2, num_thread_examples=3)

# generating excel summary
EDA_osha_accident_injury_raw.get_column_info()
EDA_osha_accident_injury_raw.write_summary('EDA_osha_accident_injury_raw.xlsx')

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.0

Aug 15, 2023

0.1.9

Aug 14, 2023

0.1.8

Aug 14, 2023

0.1.7

Aug 14, 2023

0.1.6

Aug 14, 2023

0.1.5

Aug 14, 2023

0.1.4

Aug 14, 2023

0.1.3

Jul 27, 2023

0.1.2

Jul 21, 2023

0.1.1

Jul 19, 2023

0.1.0

Jul 19, 2023

0.0.9

Jul 19, 2023

0.0.8

Jul 19, 2023

0.0.7

Oct 24, 2022

0.0.6

Oct 20, 2022

0.0.5

Oct 20, 2022

0.0.4

Oct 20, 2022

0.0.3

Sep 26, 2022

0.0.2

Sep 15, 2022

0.0.1

Sep 14, 2022

0.0.0

Sep 13, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spd_eda-0.2.0.tar.gz (86.6 kB view hashes)

Uploaded Aug 15, 2023 Source

Built Distribution

spd_eda-0.2.0-py3-none-any.whl (93.9 kB view hashes)

Uploaded Aug 15, 2023 Python 3

Hashes for spd_eda-0.2.0.tar.gz

Hashes for spd_eda-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c0838c7d12c92cb36d63c0997b4f6cddf7960c23313a12d02c9a95481b04a867`
MD5	`bd55b0fdb6a7f1709f2e529e76859682`
BLAKE2b-256	`ff1cd0e11ca82c43eaa9d05353deeff147cdc9942a1718e5587ee13f2746629f`

Hashes for spd_eda-0.2.0-py3-none-any.whl

Hashes for spd_eda-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ec9fbc6a8d23a65111c66d4c9605876881ffcac5632d9ce00bb0b70d670a0c5`
MD5	`99598a80b86ab3f7559967c47f003deb`
BLAKE2b-256	`b6eb5e033fd1f1abed223e49fd09111e7f51bdc53774909874140e5b2e7f6d5c`