Compute metrics across combinations of features. Stop clicking around in tableau.

These details have not been verified by PyPI

Project description

Overview

tldr: Do you hate trying to breakdown which underlying trends or movements are driving topline metric movements? HSA can solve that.

Hot Spot Analysis (HSA) is an analytic reporting framework that removes any statitical ambiguity. HSA is meant to enhance reporting, find insights, and easily dive further into the 'why' metrics have shifted. This is done by automatically running all viable cuts within the data across the provided features for any metrics.

Future updates plan to add the following functionality:

multiprocessing to improve module calculation speed
support for non-dataframe user functions (graphs, etc.)

Short Theoretical Demonstration:

If we have 3 columns [a, b, c], and we want to cut our data using those columns we would have to group our data as such to know all of the interactions' impact on our metric of interst. And this problem becomes increasingly complicated as we increase the number of columns.

Using 3 columns: [a, b, c] -> 7 valid data cuts

@ depth = 1: [a,b,c] <- 3 data cuts
@ depth = 2: [ab,ac,bc] <- 3 data cuts
@ depth = 3: [abc] <- 1 data cuts

HSA Output Data Structure:

index	depth	data_cuts	data_content	data_cut_content	user function output
1	1	[column a]	[row_value x]	['a:x']	[Int/float/etc.]
2	1	[column b]	[row_value y]	['b:y']	[Int/float/etc.]
3	1	[column c]	[row_value z]	['b:y']	[Int/float/etc.]
4	2	[Columns a, b]	[row_value x, y]	['a:x', 'b:y']	[Int/float/etc.]
5	2	[Columns a, c]	[row_value x, z]	['a:x', 'c:z']	[Int/float/etc.]
6	2	[Columns c, b]	[row_value y, z]	['b:y', 'c:z']	[Int/float/etc.]
7	3	[Columns a, b, c]	[row_value x, y, z]	['a:x', 'b:y', 'c:z']	[Int/float/etc.]

Note: Each column yields X rows equal determined by number of unique values. Thus 'ab' woudl yield a_N * b_M rows in the output where column a has N unique values, and column B has M unique values thus ab yields N*M rows.

An Example:

Using the titanic data from seaborn we can look at a semi-practical example using some data.

survived	class	adult_male	embark_town
0	Third	True	Southampton
1	First	False	Cherbourg
1	First	False	Southampton
0	Third	True	Queenstown
for each of the 891 passengers on the titanic

A Simple Example Using hot_spot_analysis:

import numpy as np
import pandas as pd
import seaborn as sb
from hot_spot_analysis.hot_spot_analysis import HotSpotAnalysis

# Load our data
df = sb.load_dataset('titanic')
titanic = df[['survived', 'class',  'adult_male', 'embark_town']]

# Define our metric function
def survival_rate(data):
    temp = data.agg(survival_rate = pd.NamedAgg('survived', np.mean))
    return temp

# Input our data cuts, depth limit, and data
hsa = HotSpotAnalysis(
    data_cuts=['class',  'adult_male', 'embark_town'],
    depth_limit = 3,
    data = titanic
)

# Run the hot spot analysis
hsa.run_hsa(survival_rate)

# Export the data
hsa_output = hsa.get_hsa_data() # export the analysis results

# Review some of the features
print(hsa_output.head())
print(hsa_output.tail())

# Or use some of the built in search features
hsa.search_hsa_data(
    target_var = 'data_content', 
    search_terms = 'Southampton'
    )

A (mostly) pandas example without hot_spot_analysis:

Does using hot_spot_analysis actually make life that much easier? YES.

Looking at the following for

import numpy as np
import pandas as pd
import seaborn as sb

df = sb.load_dataset('titanic')
titanic = df[['survived', 'class',  'adult_male', 'embark_town']]

def survival_rate(data):
    temp = data.agg(survival_rate = pd.NamedAgg('survived', np.mean))
    return temp

titanic_by_class = survival_rate(titanic.groupby('class'))
titanic_by_adult_male = survival_rate(titanic.groupby('adult_male'))
titanic_by_embark_town = survival_rate(titanic.groupby('embark_town'))
titanic_by_class_adult_male = survival_rate(titanic.groupby(['class', 'adult_male']))
titanic_by_class_embark_town = survival_rate(titanic.groupby(['class', 'embark_town']))
titanic_by_adult_male_embark_town = survival_rate(titanic.groupby(['adult_male', 'embark_town']))
titanic_by_all = survival_rate(titanic.groupby(['class', 'adult_male', 'embark_town']))

# Combine the data frames
dfs = [
    titanic_by_class,
    titanic_by_adult_male,
    titanic_by_embark_town,
    titanic_by_class_adult_male,
    titanic_by_class_embark_town,
    titanic_by_adult_male_embark_town,
    titanic_by_all
]

all_df = pd.concat(dfs, join='outer', axis=1).fillna(np.NaN)

# Review some of the features
print(all_df.head())
print(all_df.tail())

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.4.1

Jul 14, 2024

1.0.4

Jul 7, 2024

1.0.4b0 pre-release

Jul 7, 2024

1.0.4a0 pre-release

May 28, 2024

1.0.2

May 19, 2024

This version

0.1.4

Feb 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hot_spot_analysis-0.1.4.tar.gz (8.9 kB view hashes)

Uploaded Feb 28, 2023 Source

Built Distribution

hot_spot_analysis-0.1.4-py3-none-any.whl (8.6 kB view hashes)

Uploaded Feb 28, 2023 Python 3

Hashes for hot_spot_analysis-0.1.4.tar.gz

Hashes for hot_spot_analysis-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`b8f8e28451b695dbfd4edc22db8e92a7b27aa3a9f1a8630e3966e8f16618fab2`
MD5	`c7bf2708463d066f7aee834f7c093447`
BLAKE2b-256	`f07d7c5b059af9319295531a8231846704332199b02ed57c33d6d815e4cee859`

Hashes for hot_spot_analysis-0.1.4-py3-none-any.whl

Hashes for hot_spot_analysis-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80c2e1a6bd99a8aea740bb98911e042aebb6a084fee4faf9836e57587804a33d`
MD5	`d87640351f146e27e3d4912b31f7900a`
BLAKE2b-256	`f34d5a28fa0432e1618d466445f39d79fea0cc4aaae55bbb7bb3d3b8daeeeed0`