Skip to main content

initial testing

Project description

SPD's collection of EDA tools

For use on pandas dataframes


DataframeStructure

Use the DataframeStructure class for quick structural analysis on a pandas dataframe.

from spd_eda import DataframeStructure

DataframeStructure(df)

Arguments:

  • df: dataframe to be reviewed

Class Methods:

  • thread_summary(thread): pass a "thread" (i.e. list of columns) to see distribution of row counts within each thread
  • thread_examples(thread, recs_per_thread, num_examples): return examples illustrating to investigate

Sample usage:

from spd_eda import DataframeStructure

# define the object
df_struct = DataframeStructure(df)

# see record count distribution by THREAD
THREAD = ['Policy']
df_struct.thread_summary(THREAD)

# see examples for specified distributions
df_struct.thread_examples(THREAD, recs_per_thread=3, num_examples=5)

Binning Methods

Use the return_binned_series function to return binned version of a pandas series

Function arguments:

  • s: pandas series
  • bin_info: dictionary specifying how the binning should be done. See below for binning options
  • fallback_text (optional): for categorical binning, specify the "fallback" values

Bin-info options:

  • n-tiles: Roughly evenly spaced: bin_info = ('num_ntiles', 5) would try to create 5 bins of equal size (e.g. similar to pd.cut())
  • percentiles: Specify the bin percentages: bin_info = ('pctile_list', [0, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0]) would try to create bins with 10/15/25/25/15/10 volumes (e.g. similar to pd.qcut() with specified percentiles)
  • fixed cutpoints: Specify the LEFT end-points of intervals (inclusive): bin_info = ('fixed_cutpoints', [0, 39, 40, 43]) would try to create bins beginning at 0, 39, 40 and 43 (e.g. similar to pd.cut() with specified bins & right=False)
  • categorical values: specify the stand-alone values: bin_info = ('cat_stand_alone_values', ['A', 'B', 'C']) would keep A-C as stand-alone values & all other values would be combined as "everything else"

Return Value: A new Series using the binned values. Note that the new Series has the same name as the original series. Users will need to be careful when using the results to avoid ambiguity with the original (un-binned) series.


Peril Analysis

Use the PerilAnalysis class to interactively explore data for custom aggregation func tions.

Arguments:

  • df: dataframe
  • column_list_to_consider: list of columns to analyze (typically a subset of the dataframe's columns)
  • peril_abbr: abbreviation for the peril
  • peril_cc: name of claim count column
  • peril_il: name of loss amount column
  • stat_function: name of aggregation function to use in analysis
  • stat_function_base_cols: list of columns that aggregation function requires that aren't already passed to stat_function as arguments
  • signal_metric_col: column produced by stat_function that is the primary metric of interest (used for signal calculation and y-axis charting)
  • signal_volume_col: column produced by stat_function that is the weighting value (used for signal calculation and y-axis charting)

Methods:

  • export_peril_exhibits(name, col_type_dict, col_cond_format_list): class method to export results to excel file
  • inspect_ftr(ftr_name): class method to inspect results of particular column in the analysis
  • update_ftr(ftr_name): class method to UPDATE the results of a particular column, usually by modifying the binning. This will update the underlying stats/charts & would be reflected in subsequent exports to excel.

Sample usage:

from spd_eda import PerilAnalysis

# 1) define the object
analysis = PerilAnalysis(
    df = df,
    column_list_to_consider = ['C1', 'C2', 'C3'],
    
    # columns related to the specific peril
    peril_abbr = 'Test',
    peril_cc = 'CLM_ClaimCount_Test',  # claim count column for the peril
    peril_il = 'CLM_Incurred_Test',  # loss column for the peril
    stat_function = my_stats_fcn,  # aggregation function
    stat_function_base_cols = ['Policy', 'Exposure', 'Premium'],     # columns that aggregation function refers to
    signal_metric_col = 'Frequency_Test',  # desired column (output of aggregation function) for signal calculations (also used for plotting line chart)
    signal_volume_col='house_years',  # desired column (output of aggregation function) for signal weighting (i.e. volume).  Also used for bar charts.
    
)

# 2) write results to excel
# optional number formatting for the excel file
col_type_dict = {
    'house_years': '#,##0',
    'Frequency_Test': '0.00%',
    'Avg_Premium': '#,##0'
}

# conditional formatting
col_cond_format_list = ['house_years', 'Frequency_Test']

# export it
analysis.export_peril_exhibits("TestExport.xlsx", col_type_dict=col_type_dict, col_cond_format_list=col_cond_format_list)

# 3) inspect individual variable results
analysis.inspect_ftr('C1')

# 4) If desired, UPDATE variable results by changing binning logic
analysis.update_ftr('C1', ftr_bin_info=('fixed_cutpoints', [0,1,2,3,4,5]))

# 5) re-write results to excel to incorporate the update
analysis.export_peril_exhibits("TestExport_updated.xlsx", col_type_dict=col_type_dict, col_cond_format_list=col_cond_format_list)




DataFrameEDA

Use the DataFrameEda class for quick analysis on a pandas dataframe.

from spd_eda import DataframeEda

DataframeEda(
    df, agg_fcn,
    control_var_list=[], ord_bin_threshold=100, ord_bin_count=5, cat_value_limit=30
    )

Arguments:

  • df: dataframe to be reviewed
  • ag_fcn: user-defined aggregation function that can be applied to the dataframe
  • control_var_list (optional): list of additional one-way variables for control totals (defaults to empty list)
  • ord_bin_threshold (optional): maximum number of distinct numeric values before bucketing will occur (default=100)
  • ord_bin_count (optional): If ordinal bucketing occurs, determines number of pd.cut() buckets (default=5)
  • cat_value_limit (optional): maximum number of distinct categorical values before bucketing will occur (default=30)
  • calculate_signals (optional): boolean for whether to calculate signals (based on agg_fcn) for each variable
  • signal_weight_col (optional): if signals are calculated, this is the column to use for weighting. If omitted, will use the first column from the agg_fcn
  • calculate_cramers_v (optional): boolean for whether to calculate cramers V against each of the control variables (from control_var_list)
  • start_with_user_provided_summary_df (optional): boolean to specify if user will provide the summary dataframe. If False, will use pd.describe()
  • user_provided_summary_df (optional): if user providing summary dataframe, column names MUST be in the index.

ExcelExport

Use the ExcelExport class to write results of the DataFrameEda into excel.

from spd_eda import ExcelExport

ExcelExport(
    filename, control_df_list, col_summary_df, var_info_dict,
    col_type_dict={}, col_cond_format_list=[]
    )

Arguments:

  • filename: name of excel file to be created (e.g. 'sample_output.xlsx')
  • control_df_list: a list of dataframes to include on the 'control' tab of the output
  • col_summary_df: single dataframe with one record per column... first column should be the variable name
  • var_info_dict: dictionary where keys are column names & values are one-way stats for the variable (using the agg function)
  • col_type_dict (optional): dictionary to control number formatting in excel
  • col_cond_format_list (optional): list of columns to conditionally format in the excel

Sample usage:

from spd_eda import DataframeEda, ExcelExport

# define an aggregation function
def my_agg_fcn(df):
    return df.agg({'fid': 'count', 'building_a': 'mean',})

# create EDA object
eda_obj = DataframeEda(df, my_agg_fcn, control_var_list=['city'])

# formatting options
col_type_dict = {
        'fid': '#,##0',
        'building_a': '#,##0',
        }
col_cond_format_list = ['fid']

# export it
ExcelExport('EDA_station_16_addresses.xlsx',
            eda_obj.control_df_list, eda_obj.var_summary, eda_obj.var_info_dict,
            col_type_dict=col_type_dict, col_cond_format_list=col_cond_format_list)

For use on PA/Predict entities

Must be running from a machine with database access.

DataViewEda

The DataViewEda class for doing quick analysis on a data view in the Predict software.

DataViewEda(
    db_name, dv_name,
    secondary_var_list=[], bin_override_dict={}
    )

Arguments:

  • db_name: database name
  • dv_name: data view name
  • secondary_var_list (optional): list of secondary fields that be part of two-way comparisons for all data elements
  • bin_override_dict (optional): dictionary of binning overrides (default={})

Once object is instantiated, use the export_summary() method to create formatted output in excel.

export_summary(
    filename, obj_to_format,
    stats_formatting=INS_STATS_FORMATTING_DICT
    )

Arguments:

  • filename: name of excel file to be created (e.g. 'dv_summary.xlsx')
  • obj_to_format: list of columns to conditionally format (e.g. ['EExp', 'LR'])
  • stats_formatting (optional): dictionary of column formats for the output. Has sensible defaults already, but format is like this: {'Freq': '0.00', 'Sev': '#,##0', 'LR': '0.00%'}

Sample usage:

from spd_eda import DataViewEda

test_dv_eda = DataViewEda(
        'GuidewireResearchCM_Insight',
        'eda_testing',
        secondary_var_list=['ax_year', 'LOSSSTATE'],
        )

test_dv_eda.export_summary("DataViewEDA_eda_testing.xlsx", obj_to_format=['LR'])       

AnalysisSummary

The AnalysisSummary class supplements the analysis that comes from the Predict software, producing a file of the form ModelSummary_<analysis_id>.xlsx.

AnalysisSummary(
    db_name, analysis_id,
    geo_col='LOSSSTATE', naics_col='LOB',
    objectives_to_format=[], stats_formatting=INS_STATS_FORMATTING_DICT
    )

Arguments:

  • db_name: database name
  • analysis_id: analysis ID
  • geo_col: column name already included in the data view to use for geography (will be used in two-way exhibits)
  • naics_col: column name already included in the data view to use for naics (can be anything... will just be used for two-ways)
  • objectives_to_format (optional): list of columns to conditionally format (e.g. ['EExp', 'LR'])
  • stats_formatting (optional): dictionary of column formats for the output. Has sensible defaults already, but format is like this: {'Freq': '0.00', 'Sev': '#,##0', 'LR': '0.00%'}

Sample usage:

from spd_eda import AnalysisSummary

AnalysisSummary(
    db_name='GuidewireCooperativeModelsClaimsHO_Insight',
    analysis_id='711b0a41-a49b-46df-8dff-68dd04776520',
    geo_col='LOSSSTATE', naics_col='LOB', objectives_to_format=['Freq']
    )      

Plinko

The Plinko class can be used to trace an individual policy through a Boosting model from Predict.

Note that the Plinko model references the Analysis class, which is briefly defined below.

Analysis(db_name, analysis_id)

Arguments:

  • db_name: database name
  • analysis_id: analysis ID
Plinko(model, policyno)

Arguments:

  • model: an Analysis object corresponding to a Boosting model
  • policyno: a specific POLICYNO value from the original modeling dataset (i.e. needs the model variable for the model)

Properties:

  • plinko_df: dataframe that traces the specific example through all the iterations of the boosting model

Sample usage:

from spd_eda import Analysis, Plinko

# create Analysis object
model = Analysis(
    db_name='GuidewireCooperativeModelsClaimsHO_Insight',
    analysis_id='711b0a41-a49b-46df-8dff-68dd04776520'
    )

# create specific plinko object
plinko_obj = Plinko(model, '63_155334')

# review the trace file
plinko_obj.plinko_df      

For use on Athena tables

Must have appropriate AWS credentials to connect.

AthenaTable

Use the AthenaTable class for doing quick analysis on tables in AWS Athena

AthenaTable(db_name, tbl_name)

Arguments:

  • db_name: database name
  • tbl_name: table name

Available properties:

  • row_count: integer value with number of rows in the table
  • information_schema: dataframe with metadata on the columns in the table
  • col_info: dictionary of key: value pairs, where key is the column name & value is dataframe of (binned) record counts

Available methods:

  • get_sample_records(num_rows=10, filter_string="1=1", col_subset_list=[]) - returns dataframe
  • get_row_count(filter_string="1=1") - returns row count, note the optional filter
  • get_custom_counts(custom_expression) - returns record counts based on the provided expression
  • get_records_per_thread_summary(thread_list, filter_string="1=1") - Define a "thread" as set of columns, provides value distribution of # records that exist within each thread (useful for finding keys)
  • get_thread_examples_with_specified_num_records(thread_list, records_per_thread, num_thread_examples=1, filter_string="1=1") - returns dataframe with examples of "threads" with the desired number of "records-per-thread". Useful in conjunction with get_records_per_thread_summary()
  • get_column_info() - This populates the col_info attribute... can take a long time to run.
  • write_summary(filename) - creates excel file summarizing the table

Sample Usage:

from spd_eda import AthenaTable

EDA_osha_accident_injury_raw = AthenaTable('assess_db', 'osha_accident_injury_raw')

# distributions by column & expression
EDA_osha_accident_injury_raw.get_custom_counts("degree_of_inj")
EDA_osha_accident_injury_raw.get_custom_counts("SUBSTRING (load_dt, 1, 4)")

# thread hunting
EDA_osha_accident_injury_raw.get_records_per_thread_summary(['summary_nr'])
EDA_osha_accident_injury_raw.get_records_per_thread_summary(['summary_nr', 'injury_line_nr'])
EDA_osha_accident_injury_raw.get_thread_examples_with_specified_num_records(['summary_nr', 'injury_line_nr'], 2, num_thread_examples=3)

# generating excel summary
EDA_osha_accident_injury_raw.get_column_info()
EDA_osha_accident_injury_raw.write_summary('EDA_osha_accident_injury_raw.xlsx')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spd_eda-0.2.0.tar.gz (86.6 kB view hashes)

Uploaded Source

Built Distribution

spd_eda-0.2.0-py3-none-any.whl (93.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page