Skip to main content

a python package for automatical dimension attribution

Project description

Dimtributor: a python package for automatical dimesion attribution

As a data analyst, one of out daily work is to explain the fluctuation of indicators. And dimension drill down is a common way to do that, which finds the dimesion most likely to blame for the fluctuation.
If we do it with our hand, different analysts may get different conclusions, and it takes a lot of time especially when there are many dimensions or many indicators to deal with.
In this python package, I use JS Divergence with decision tree to do dimension drill down.
Since it create the tree with a specific algorithm, we get the same conclusion no matter how many times you run it. With just a few lines of codes, it can help you to save a lot of time to deal with this daily work of fluctuation of indicators.

BASIC IDEAS

Hypothesis that your most important indicator dau dropped 30%, you want to find which specific value of which dimesion explain the most part of the decline.

Let's say, you only have two dimesions,cities and channels. For the cities ,all cities dropped exactly 30%,but for channels, channel A decreased 80%,and channel B and C increased 50%. In this case, we may want to blame channels for the drop of dau.

Here actually we are compare the real happened with should happened, we should blame the dimesion if it's a big "surprise". We calculate this "surprise" with JS Divergence, since it can help to tell how big the difference is for two distributions(real happened distribution and should happened distribution).
In this example, we should split our tree with the dimesion of channels.

But on which specific value we should split? we check for each value of channel(A,B,C), we calculte the surprise with JS Divergence after we exlcude this value. Say we get the JS Divergence 0.1 when we exclude channel A,0.3 for B,0.2 for C, then we should split the decision tree at channel A, which means the surprise is the samllest after we exclude A.

That's basically all the theories for this packages. A little difference when we want to deal with a rate indicator(numerator/denominator,like conversion rate,for example CTR=click/impression),but we can transfer it to a same question. For a dimesion say channel with two value of A and B,we can change
CTR
=click/impression
=(click_A+click_B)/(impression_A+impression_B)
=click_A/impression_A*impression_A/(impression_A+impression_B)+click_B/impression_B*impression_B/(impression_A+impression_B)
=CTR_A*impressoin_share_A+CTR_B*impressoin_share_B.
In this way we can transfer a rate problem to a quantity problem.

Install

pip install dimtributor

Example: for quantity indicator

import pandas as pd 
import numpy as np
from dimtributor.tree import DimtributorTree

df_quantity = pd.read_csv('data_quantity.csv')
df_quantity.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
group_name app_name platform country language create_datediff orders
5718 this ms android S r 8-30day 28
14928 base ms android R v 1-7day 1
12203 base ems android R n 8-30day 3
2810 base ms android s h 31-90day 181
2557 base ms android N r >90day 228
group = 'group_name' 
#columns name for group,with two values of base and this
#,if the dau of 20220102 dropped 30% compare to 20220101
#,then the value of group for 20230101 is "base"
#,the the value of group for 20230102 is "this"

value_type = 'quantity' # you are deal with a quantity indicator
dims = ["app_name","platform","country","language","create_datediff"] #all the dimensions you interested
y = 'orders' # the indicator you want to analysis
d = '' #place holder for quantity
n = '' # place holder for quantity
max_tree_depth = 3 #default value 3
min_root_weight = 0.3 #default value 0.3

dmat_quantity = DimtributorTree(df_quantity,value_type,group,dims,y,d,n,max_tree_depth,min_root_weight)
dmat_quantity.createtree()
print("outtree:",dmat_quantity.outtree)
print("png_path:",dmat_quantity.png_path)

png

outtree: {'Decrease:2137268\n27318803->25181535,-7.82%\n account for root:100.0%': {'create_datediff:0day\n account for parent:69.62%': 'create_datediff:0day\nDecrease:1488059\n1823660->335601,-81.6%\n account for root:69.62%', 'create_datediff:not 0day\n account for parent:30.38%': 'create_datediff:not 0day\nDecrease:649209\n25495143->24845934,-2.55%\n account for root:30.38%'}}
png_path: /Users/kennyzhangchao/Desktop/starx_myself/python_code/dimtributor_copy/dim_attribution_1697851303.png

Example: for rate indicator

import pandas as pd 
import numpy as np
from dimtributor.tree import DimtributorTree

df_rate = pd.read_csv('data_rate.csv')
df_rate.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
group_name app_name platform region language denominator numerator conversion_rate
2603 this ems android W h 1 0 0.000000
18 this ms android R r 41352 16315 0.394540
34 base ms android U u 23386 10042 0.429402
1043 base ms android E w 8 1 0.125000
356 this tv android N a 185 94 0.508108
group = 'group_name' 
#columns name for group,with two values of base and this
#,if the conversion_rate of 20220102 dropped 1% compare to 20220101
#,then the value of group for 20230101 is "base"
#,the the value of group for 20230102 is "this"

value_type = 'rate' # you are deal with a rate(or conversion) indicator
dims = ["app_name","platform","region","language"] #all the dimensions you interested
y = 'conversion_rate' # the indicator you want to analysis
d = 'denominator' #must for rate indicators
n = 'numerator' #must for rate indicators
max_tree_depth = 3 #default value 3
min_root_weight = 0.3 #default value 0.3

dmat_rate = DimtributorTree(df_rate,value_type,group,dims,y,d,n,max_tree_depth,min_root_weight)
dmat_rate.createtree()
print("outtree:",dmat_rate.outtree)
print("png_path:",dmat_rate.png_path)

png

outtree: {'Decrease:23.87%\n percentage:100.0%->100.0%=0.0%\n rate:33.88%->10.01%=-23.87%\n account for root:100.0%': {'region:S\n account for parent:74.38%': 'region:S\nDecrease:17.76%\n percentage:62.87%->74.94%=12.07%\n rate:30.79%->2.14%=-28.66%\n account for root:74.38%', 'region:not S\n account for parent:25.62%': 'region:not S\nDecrease:6.12%\n percentage:37.13%->25.06%=-12.07%\n rate:39.11%->33.55%=-5.57%\n account for root:25.62%'}}
png_path: /Users/kennyzhangchao/Desktop/starx_myself/python_code/dimtributor_copy/dim_attribution_1697851318.png

References

  1. I updated my code with the reference of Adtributor.
  2. codes related to Desicion Tree referenced from the book of 《Machine Learning in Action》 by Peter Harrington.

Github dimtributor

TO BE DONE:

1、deal with chinese character display in png;
2、use greater than/less than to deal with continues dimensions, like create_datediff;



          

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dimtributor-0.0.3.tar.gz (345.5 kB view hashes)

Uploaded Source

Built Distribution

dimtributor-0.0.3-py3-none-any.whl (10.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page