Compute metrics across combinations of features. Stop clicking around in tableau.
Project description
Overview
tldr: Do you hate trying to breakdown which underlying trends or movements are driving topline metric movements? HSA can solve that.
Hot Spot Analysis (HSA) is an analytic reporting framework that removes any statitical ambiguity. HSA is meant to enhance reporting, find insights, and easily dive further into the 'why' metrics have shifted. This is done by automatically running all viable cuts within the data across the provided features for any metrics.
Future updates plan to add the following functionality:
- multiprocessing to improve module calculation speed
- support for non-dataframe user functions (graphs, etc.)
Short Theoretical Demonstration:
If we have 3 columns [a, b, c], and we want to cut our data using those columns we would have to group our data as such to know all of the interactions' impact on our metric of interst. And this problem becomes increasingly complicated as we increase the number of columns.
Using 3 columns: [a, b, c] -> 7 valid data cuts
- @ depth = 1: [a,b,c] <- 3 data cuts
- @ depth = 2: [ab,ac,bc] <- 3 data cuts
- @ depth = 3: [abc] <- 1 data cuts
HSA Output Data Structure:
index | depth | data_cuts | data_content | data_cut_content | user function output |
---|---|---|---|---|---|
1 | 1 | [column a] | [row_value x] | ['a:x'] | [Int/float/etc.] |
2 | 1 | [column b] | [row_value y] | ['b:y'] | [Int/float/etc.] |
3 | 1 | [column c] | [row_value z] | ['b:y'] | [Int/float/etc.] |
4 | 2 | [Columns a, b] | [row_value x, y] | ['a:x', 'b:y'] | [Int/float/etc.] |
5 | 2 | [Columns a, c] | [row_value x, z] | ['a:x', 'c:z'] | [Int/float/etc.] |
6 | 2 | [Columns c, b] | [row_value y, z] | ['b:y', 'c:z'] | [Int/float/etc.] |
7 | 3 | [Columns a, b, c] | [row_value x, y, z] | ['a:x', 'b:y', 'c:z'] | [Int/float/etc.] |
Note: Each column yields X rows equal determined by number of unique values. Thus 'ab' woudl yield aN * bM rows in the output where column a has N unique values, and column B has M unique values thus ab yields N*M rows.
An Example:
Using the titanic data from seaborn we can look at a semi-practical example using some data.
survived | class | adult_male | embark_town |
---|---|---|---|
0 | Third | True | Southampton |
1 | First | False | Cherbourg |
1 | First | False | Southampton |
0 | Third | True | Queenstown |
for each of the 891 passengers on the titanic |
A Simple Example Using hot_spot_analysis:
import numpy as np
import pandas as pd
import seaborn as sb
from hot_spot_analysis.hot_spot_analysis import HotSpotAnalysis
# Load our data
df = sb.load_dataset('titanic')
titanic = df[['survived', 'class', 'adult_male', 'embark_town']]
# Define our metric function
def survival_rate(data):
temp = data.agg(survival_rate = pd.NamedAgg('survived', np.mean))
return temp
# Input our data cuts, depth limit, and data
hsa = HotSpotAnalysis(
data_cuts=['class', 'adult_male', 'embark_town'],
depth_limit = 3,
data = titanic
)
# Run the hot spot analysis
hsa.run_hsa(survival_rate)
# Export the data
hsa_output = hsa.get_hsa_data() # export the analysis results
# Review some of the features
print(hsa_output.head())
print(hsa_output.tail())
# Or use some of the built in search features
hsa.search_hsa_data(
target_var = 'data_content',
search_terms = 'Southampton'
)
A (mostly) pandas example without hot_spot_analysis:
Does using hot_spot_analysis actually make life that much easier? YES.
Looking at the following for
import numpy as np
import pandas as pd
import seaborn as sb
df = sb.load_dataset('titanic')
titanic = df[['survived', 'class', 'adult_male', 'embark_town']]
def survival_rate(data):
temp = data.agg(survival_rate = pd.NamedAgg('survived', np.mean))
return temp
titanic_by_class = survival_rate(titanic.groupby('class'))
titanic_by_adult_male = survival_rate(titanic.groupby('adult_male'))
titanic_by_embark_town = survival_rate(titanic.groupby('embark_town'))
titanic_by_class_adult_male = survival_rate(titanic.groupby(['class', 'adult_male']))
titanic_by_class_embark_town = survival_rate(titanic.groupby(['class', 'embark_town']))
titanic_by_adult_male_embark_town = survival_rate(titanic.groupby(['adult_male', 'embark_town']))
titanic_by_all = survival_rate(titanic.groupby(['class', 'adult_male', 'embark_town']))
# Combine the data frames
dfs = [
titanic_by_class,
titanic_by_adult_male,
titanic_by_embark_town,
titanic_by_class_adult_male,
titanic_by_class_embark_town,
titanic_by_adult_male_embark_town,
titanic_by_all
]
all_df = pd.concat(dfs, join='outer', axis=1).fillna(np.NaN)
# Review some of the features
print(all_df.head())
print(all_df.tail())
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for hot_spot_analysis-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 80c2e1a6bd99a8aea740bb98911e042aebb6a084fee4faf9836e57587804a33d |
|
MD5 | d87640351f146e27e3d4912b31f7900a |
|
BLAKE2b-256 | f34d5a28fa0432e1618d466445f39d79fea0cc4aaae55bbb7bb3d3b8daeeeed0 |