a python package for automatical dimension attribution
Project description
Dimtributor: a python package for automatical dimesion attribution
As a data analyst, one of out daily work is to explain the fluctuation of indicators. And dimension drill down is a common way to do that, which finds the dimesion most likely to blame for the fluctuation.
If we do it with our hand, different analysts may get different conclusions, and it takes a lot of time especially when there are many dimensions or many indicators to deal with.
In this python package, I use JS Divergence with decision tree to do dimension drill down.
Since it create the tree with a specific algorithm, we get the same conclusion no matter how many times you run it.
With just a few lines of codes, it can help you to save a lot of time to deal with this daily work of fluctuation of indicators.
BASIC IDEAS
Hypothesis that your most important indicator dau dropped 30%, you want to find which specific value of which dimesion explain the most part of the decline.
Let's say, you only have two dimesions,cities and channels. For the cities ,all cities dropped exactly 30%,but for channels, channel A decreased 80%,and channel B and C increased 50%. In this case, we may want to blame channels for the drop of dau.
Here actually we are compare the real happened with should happened, we should blame the dimesion if it's a big "surprise". We calculate this "surprise" with JS Divergence, since it can help to tell how big the difference is for two distributions(real happened distribution and should happened distribution).
In this example, we should split our tree with the dimesion of channels.
But on which specific value we should split? we check for each value of channel(A,B,C), we calculte the surprise with JS Divergence after we exlcude this value. Say we get the JS Divergence 0.1 when we exclude channel A,0.3 for B,0.2 for C, then we should split the decision tree at channel A, which means the surprise is the samllest after we exclude A.
That's basically all the theories for this packages.
A little difference when we want to deal with a rate indicator(numerator/denominator,like conversion rate,for example CTR=click/impression),but we can transfer it to a same question. For a dimesion say channel with two value of A and B,we can change
CTR
=click/impression
=(click_A+click_B)/(impression_A+impression_B)
=click_A/impression_A*impression_A/(impression_A+impression_B)+click_B/impression_B*impression_B/(impression_A+impression_B)
=CTR_A*impressoin_share_A+CTR_B*impressoin_share_B.
In this way we can transfer a rate problem to a quantity problem.
Install
pip install dimtributor
Example: for quantity indicator
import pandas as pd
import numpy as np
from dimtributor.tree import DimtributorTree
df_quantity = pd.read_csv('data_quantity.csv')
df_quantity.sample(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
group_name | app_name | platform | country | language | create_datediff | orders | |
---|---|---|---|---|---|---|---|
5718 | this | ms | android | S | r | 8-30day | 28 |
14928 | base | ms | android | R | v | 1-7day | 1 |
12203 | base | ems | android | R | n | 8-30day | 3 |
2810 | base | ms | android | s | h | 31-90day | 181 |
2557 | base | ms | android | N | r | >90day | 228 |
group = 'group_name'
#columns name for group,with two values of base and this
#,if the dau of 20220102 dropped 30% compare to 20220101
#,then the value of group for 20230101 is "base"
#,the the value of group for 20230102 is "this"
value_type = 'quantity' # you are deal with a quantity indicator
dims = ["app_name","platform","country","language","create_datediff"] #all the dimensions you interested
y = 'orders' # the indicator you want to analysis
d = '' #place holder for quantity
n = '' # place holder for quantity
max_tree_depth = 3 #default value 3
min_root_weight = 0.3 #default value 0.3
dmat_quantity = DimtributorTree(df_quantity,value_type,group,dims,y,d,n,max_tree_depth,min_root_weight)
dmat_quantity.createtree()
print("outtree:",dmat_quantity.outtree)
print("png_path:",dmat_quantity.png_path)
outtree: {'Decrease:2137268\n27318803->25181535,-7.82%\n account for root:100.0%': {'create_datediff:0day\n account for parent:69.62%': 'create_datediff:0day\nDecrease:1488059\n1823660->335601,-81.6%\n account for root:69.62%', 'create_datediff:not 0day\n account for parent:30.38%': 'create_datediff:not 0day\nDecrease:649209\n25495143->24845934,-2.55%\n account for root:30.38%'}}
png_path: /Users/kennyzhangchao/Desktop/starx_myself/python_code/dimtributor_copy/dim_attribution_1697851303.png
Example: for rate indicator
import pandas as pd
import numpy as np
from dimtributor.tree import DimtributorTree
df_rate = pd.read_csv('data_rate.csv')
df_rate.sample(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
group_name | app_name | platform | region | language | denominator | numerator | conversion_rate | |
---|---|---|---|---|---|---|---|---|
2603 | this | ems | android | W | h | 1 | 0 | 0.000000 |
18 | this | ms | android | R | r | 41352 | 16315 | 0.394540 |
34 | base | ms | android | U | u | 23386 | 10042 | 0.429402 |
1043 | base | ms | android | E | w | 8 | 1 | 0.125000 |
356 | this | tv | android | N | a | 185 | 94 | 0.508108 |
group = 'group_name'
#columns name for group,with two values of base and this
#,if the conversion_rate of 20220102 dropped 1% compare to 20220101
#,then the value of group for 20230101 is "base"
#,the the value of group for 20230102 is "this"
value_type = 'rate' # you are deal with a rate(or conversion) indicator
dims = ["app_name","platform","region","language"] #all the dimensions you interested
y = 'conversion_rate' # the indicator you want to analysis
d = 'denominator' #must for rate indicators
n = 'numerator' #must for rate indicators
max_tree_depth = 3 #default value 3
min_root_weight = 0.3 #default value 0.3
dmat_rate = DimtributorTree(df_rate,value_type,group,dims,y,d,n,max_tree_depth,min_root_weight)
dmat_rate.createtree()
print("outtree:",dmat_rate.outtree)
print("png_path:",dmat_rate.png_path)
outtree: {'Decrease:23.87%\n percentage:100.0%->100.0%=0.0%\n rate:33.88%->10.01%=-23.87%\n account for root:100.0%': {'region:S\n account for parent:74.38%': 'region:S\nDecrease:17.76%\n percentage:62.87%->74.94%=12.07%\n rate:30.79%->2.14%=-28.66%\n account for root:74.38%', 'region:not S\n account for parent:25.62%': 'region:not S\nDecrease:6.12%\n percentage:37.13%->25.06%=-12.07%\n rate:39.11%->33.55%=-5.57%\n account for root:25.62%'}}
png_path: /Users/kennyzhangchao/Desktop/starx_myself/python_code/dimtributor_copy/dim_attribution_1697851318.png
References
- I updated my code with the reference of Adtributor.
- codes related to Desicion Tree referenced from the book of 《Machine Learning in Action》 by Peter Harrington.
Github dimtributor
TO BE DONE:
1、deal with chinese character display in png;
2、use greater than/less than to deal with continues dimensions, like create_datediff;
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dimtributor-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e2ae275b383bb889885e40258c13dea502b268b64f00b460210123b2fb3b776 |
|
MD5 | fae526bcf7007bf9ed16fe1d5816c5a0 |
|
BLAKE2b-256 | 224258e682281c41adece1730e09481e0e8c14145965f3fd96741c62c2b5d52d |