initial testing
Project description
SPD's collection of EDA tools
For use on pandas dataframes
DataframeStructure
Use the DataframeStructure
class for quick structural analysis on a pandas dataframe.
from spd_eda import DataframeStructure
DataframeStructure(df)
Arguments:
df
: dataframe to be reviewed
Class Methods:
thread_summary(thread)
: pass a "thread" (i.e. list of columns) to see distribution of row counts within each threadthread_examples(thread, recs_per_thread, num_examples)
: return examples illustrating to investigate
Sample usage:
from spd_eda import DataframeStructure
# define the object
df_struct = DataframeStructure(df)
# see record count distribution by THREAD
THREAD = ['Policy']
df_struct.thread_summary(THREAD)
# see examples for specified distributions
df_struct.thread_examples(THREAD, recs_per_thread=3, num_examples=5)
Binning Methods
Use the return_binned_series
function to return binned version of a pandas series
Function arguments:
s
: pandas seriesbin_info
: dictionary specifying how the binning should be done. See below for binning optionsfallback_text
(optional): for categorical binning, specify the "fallback" values
Bin-info options:
- n-tiles: Roughly evenly spaced:
bin_info = ('num_ntiles', 5)
would try to create 5 bins of equal size (e.g. similar to pd.cut()) - percentiles: Specify the bin percentages:
bin_info = ('pctile_list', [0, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0])
would try to create bins with 10/15/25/25/15/10 volumes (e.g. similar to pd.qcut() with specified percentiles) - fixed cutpoints: Specify the LEFT end-points of intervals (inclusive):
bin_info = ('fixed_cutpoints', [0, 39, 40, 43])
would try to create bins beginning at 0, 39, 40 and 43 (e.g. similar to pd.cut() with specified bins & right=False) - categorical values: specify the stand-alone values:
bin_info = ('cat_stand_alone_values', ['A', 'B', 'C'])
would keep A-C as stand-alone values & all other values would be combined as "everything else"
Return Value: A new Series using the binned values. Note that the new Series has the same name as the original series. Users will need to be careful when using the results to avoid ambiguity with the original (un-binned) series.
Peril Analysis
Use the PerilAnalysis
class to interactively explore data for custom aggregation func tions.
Arguments:
df
: dataframecolumn_list_to_consider
: list of columns to analyze (typically a subset of the dataframe's columns)peril_abbr
: abbreviation for the perilperil_cc
: name of claim count columnperil_il
: name of loss amount columnstat_function
: name of aggregation function to use in analysisstat_function_base_cols
: list of columns that aggregation function requires that aren't already passed to stat_function as argumentssignal_metric_col
: column produced bystat_function
that is the primary metric of interest (used for signal calculation and y-axis charting)signal_volume_col
: column produced bystat_function
that is the weighting value (used for signal calculation and y-axis charting)
Methods:
export_peril_exhibits(name, col_type_dict, col_cond_format_list)
: class method to export results to excel fileinspect_ftr(ftr_name)
: class method to inspect results of particular column in the analysisupdate_ftr(ftr_name)
: class method to UPDATE the results of a particular column, usually by modifying the binning. This will update the underlying stats/charts & would be reflected in subsequent exports to excel.
Sample usage:
from spd_eda import PerilAnalysis
# 1) define the object
analysis = PerilAnalysis(
df = df,
column_list_to_consider = ['C1', 'C2', 'C3'],
# columns related to the specific peril
peril_abbr = 'Test',
peril_cc = 'CLM_ClaimCount_Test', # claim count column for the peril
peril_il = 'CLM_Incurred_Test', # loss column for the peril
stat_function = my_stats_fcn, # aggregation function
stat_function_base_cols = ['Policy', 'Exposure', 'Premium'], # columns that aggregation function refers to
signal_metric_col = 'Frequency_Test', # desired column (output of aggregation function) for signal calculations (also used for plotting line chart)
signal_volume_col='house_years', # desired column (output of aggregation function) for signal weighting (i.e. volume). Also used for bar charts.
)
# 2) write results to excel
# optional number formatting for the excel file
col_type_dict = {
'house_years': '#,##0',
'Frequency_Test': '0.00%',
'Avg_Premium': '#,##0'
}
# conditional formatting
col_cond_format_list = ['house_years', 'Frequency_Test']
# export it
analysis.export_peril_exhibits("TestExport.xlsx", col_type_dict=col_type_dict, col_cond_format_list=col_cond_format_list)
# 3) inspect individual variable results
analysis.inspect_ftr('C1')
# 4) If desired, UPDATE variable results by changing binning logic
analysis.update_ftr('C1', ftr_bin_info=('fixed_cutpoints', [0,1,2,3,4,5]))
# 5) re-write results to excel to incorporate the update
analysis.export_peril_exhibits("TestExport_updated.xlsx", col_type_dict=col_type_dict, col_cond_format_list=col_cond_format_list)
DataFrameEDA
Use the DataFrameEda
class for quick analysis on a pandas dataframe.
from spd_eda import DataframeEda
DataframeEda(
df, agg_fcn,
control_var_list=[], ord_bin_threshold=100, ord_bin_count=5, cat_value_limit=30
)
Arguments:
df
: dataframe to be reviewedag_fcn
: user-defined aggregation function that can be applied to the dataframecontrol_var_list
(optional): list of additional one-way variables for control totals (defaults to empty list)ord_bin_threshold
(optional): maximum number of distinct numeric values before bucketing will occur (default=100)ord_bin_count
(optional): If ordinal bucketing occurs, determines number ofpd.cut()
buckets (default=5)cat_value_limit
(optional): maximum number of distinct categorical values before bucketing will occur (default=30)calculate_signals
(optional): boolean for whether to calculate signals (based on agg_fcn) for each variablesignal_weight_col
(optional): if signals are calculated, this is the column to use for weighting. If omitted, will use the first column from the agg_fcncalculate_cramers_v
(optional): boolean for whether to calculate cramers V against each of the control variables (fromcontrol_var_list
)start_with_user_provided_summary_df
(optional): boolean to specify if user will provide the summary dataframe. If False, will use pd.describe()user_provided_summary_df
(optional): if user providing summary dataframe, column names MUST be in the index.
ExcelExport
Use the ExcelExport
class to write results of the DataFrameEda
into excel.
from spd_eda import ExcelExport
ExcelExport(
filename, control_df_list, col_summary_df, var_info_dict,
col_type_dict={}, col_cond_format_list=[]
)
Arguments:
filename
: name of excel file to be created (e.g. 'sample_output.xlsx')control_df_list
: a list of dataframes to include on the 'control' tab of the outputcol_summary_df
: single dataframe with one record per column... first column should be the variable namevar_info_dict
: dictionary where keys are column names & values are one-way stats for the variable (using the agg function)col_type_dict
(optional): dictionary to control number formatting in excelcol_cond_format_list
(optional): list of columns to conditionally format in the excel
Sample usage:
from spd_eda import DataframeEda, ExcelExport
# define an aggregation function
def my_agg_fcn(df):
return df.agg({'fid': 'count', 'building_a': 'mean',})
# create EDA object
eda_obj = DataframeEda(df, my_agg_fcn, control_var_list=['city'])
# formatting options
col_type_dict = {
'fid': '#,##0',
'building_a': '#,##0',
}
col_cond_format_list = ['fid']
# export it
ExcelExport('EDA_station_16_addresses.xlsx',
eda_obj.control_df_list, eda_obj.var_summary, eda_obj.var_info_dict,
col_type_dict=col_type_dict, col_cond_format_list=col_cond_format_list)
For use on PA/Predict entities
Must be running from a machine with database access.
DataViewEda
The DataViewEda
class for doing quick analysis on a data view in the Predict software.
DataViewEda(
db_name, dv_name,
secondary_var_list=[], bin_override_dict={}
)
Arguments:
db_name
: database namedv_name
: data view namesecondary_var_list
(optional): list of secondary fields that be part of two-way comparisons for all data elementsbin_override_dict
(optional): dictionary of binning overrides (default={})
Once object is instantiated, use the export_summary()
method to create formatted output in excel.
export_summary(
filename, obj_to_format,
stats_formatting=INS_STATS_FORMATTING_DICT
)
Arguments:
filename
: name of excel file to be created (e.g. 'dv_summary.xlsx')obj_to_format
: list of columns to conditionally format (e.g. ['EExp', 'LR'])stats_formatting
(optional): dictionary of column formats for the output. Has sensible defaults already, but format is like this: {'Freq': '0.00', 'Sev': '#,##0', 'LR': '0.00%'}
Sample usage:
from spd_eda import DataViewEda
test_dv_eda = DataViewEda(
'GuidewireResearchCM_Insight',
'eda_testing',
secondary_var_list=['ax_year', 'LOSSSTATE'],
)
test_dv_eda.export_summary("DataViewEDA_eda_testing.xlsx", obj_to_format=['LR'])
AnalysisSummary
The AnalysisSummary
class supplements the analysis that comes from the Predict software, producing a file of the form ModelSummary_<analysis_id>.xlsx
.
AnalysisSummary(
db_name, analysis_id,
geo_col='LOSSSTATE', naics_col='LOB',
objectives_to_format=[], stats_formatting=INS_STATS_FORMATTING_DICT
)
Arguments:
db_name
: database nameanalysis_id
: analysis IDgeo_col
: column name already included in the data view to use for geography (will be used in two-way exhibits)naics_col
: column name already included in the data view to use for naics (can be anything... will just be used for two-ways)objectives_to_format
(optional): list of columns to conditionally format (e.g. ['EExp', 'LR'])stats_formatting
(optional): dictionary of column formats for the output. Has sensible defaults already, but format is like this: {'Freq': '0.00', 'Sev': '#,##0', 'LR': '0.00%'}
Sample usage:
from spd_eda import AnalysisSummary
AnalysisSummary(
db_name='GuidewireCooperativeModelsClaimsHO_Insight',
analysis_id='711b0a41-a49b-46df-8dff-68dd04776520',
geo_col='LOSSSTATE', naics_col='LOB', objectives_to_format=['Freq']
)
Plinko
The Plinko
class can be used to trace an individual policy through a Boosting model from Predict.
Note that the Plinko
model references the Analysis
class, which is briefly defined below.
Analysis(db_name, analysis_id)
Arguments:
db_name
: database nameanalysis_id
: analysis ID
Plinko(model, policyno)
Arguments:
model
: anAnalysis
object corresponding to a Boosting modelpolicyno
: a specific POLICYNO value from the original modeling dataset (i.e. needs the model variable for the model)
Properties:
plinko_df
: dataframe that traces the specific example through all the iterations of the boosting model
Sample usage:
from spd_eda import Analysis, Plinko
# create Analysis object
model = Analysis(
db_name='GuidewireCooperativeModelsClaimsHO_Insight',
analysis_id='711b0a41-a49b-46df-8dff-68dd04776520'
)
# create specific plinko object
plinko_obj = Plinko(model, '63_155334')
# review the trace file
plinko_obj.plinko_df
For use on Athena tables
Must have appropriate AWS credentials to connect.
AthenaTable
Use the AthenaTable
class for doing quick analysis on tables in AWS Athena
AthenaTable(db_name, tbl_name)
Arguments:
db_name
: database nametbl_name
: table name
Available properties:
row_count
: integer value with number of rows in the tableinformation_schema
: dataframe with metadata on the columns in the tablecol_info
: dictionary of key: value pairs, where key is the column name & value is dataframe of (binned) record counts
Available methods:
get_sample_records(num_rows=10, filter_string="1=1", col_subset_list=[])
- returns dataframeget_row_count(filter_string="1=1")
- returns row count, note the optional filterget_custom_counts(custom_expression)
- returns record counts based on the provided expressionget_records_per_thread_summary(thread_list, filter_string="1=1")
- Define a "thread" as set of columns, provides value distribution of # records that exist within each thread (useful for finding keys)get_thread_examples_with_specified_num_records(thread_list, records_per_thread, num_thread_examples=1, filter_string="1=1")
- returns dataframe with examples of "threads" with the desired number of "records-per-thread". Useful in conjunction withget_records_per_thread_summary()
get_column_info()
- This populates thecol_info
attribute... can take a long time to run.write_summary(filename)
- creates excel file summarizing the table
Sample Usage:
from spd_eda import AthenaTable
EDA_osha_accident_injury_raw = AthenaTable('assess_db', 'osha_accident_injury_raw')
# distributions by column & expression
EDA_osha_accident_injury_raw.get_custom_counts("degree_of_inj")
EDA_osha_accident_injury_raw.get_custom_counts("SUBSTRING (load_dt, 1, 4)")
# thread hunting
EDA_osha_accident_injury_raw.get_records_per_thread_summary(['summary_nr'])
EDA_osha_accident_injury_raw.get_records_per_thread_summary(['summary_nr', 'injury_line_nr'])
EDA_osha_accident_injury_raw.get_thread_examples_with_specified_num_records(['summary_nr', 'injury_line_nr'], 2, num_thread_examples=3)
# generating excel summary
EDA_osha_accident_injury_raw.get_column_info()
EDA_osha_accident_injury_raw.write_summary('EDA_osha_accident_injury_raw.xlsx')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.