Project description

Dimtributor: a python package for automatical dimesion attribution

As a data analyst, one of out daily work is to explain the fluctuation of indicators. And dimension drill down is a common way to do that, which finds the dimesion most likely to blame for the fluctuation.
If we do it with our hand, different analysts may get different conclusions, and it takes a lot of time especially when there are many dimensions or many indicators to deal with.
In this python package, I use JS Divergence with decision tree to do dimension drill down.
Since it create the tree with a specific algorithm, we get the same conclusion no matter how many times you run it. With just a few lines of codes, it can help you to save a lot of time to deal with this daily work of fluctuation of indicators.

BASIC IDEAS

Hypothesis that your most important indicator dau dropped 30%, you want to find which specific value of which dimesion explain the most part of the decline.

Let's say, you only have two dimesions,cities and channels. For the cities ,all cities dropped exactly 30%,but for channels, channel A decreased 80%,and channel B and C increased 50%. In this case, we may want to blame channels for the drop of dau.

Here actually we are compare the real happened with should happened, we should blame the dimesion if it's a big "surprise". We calculate this "surprise" with JS Divergence, since it can help to tell how big the difference is for two distributions(real happened distribution and should happened distribution).
In this example, we should split our tree with the dimesion of channels.

But on which specific value we should split? we check for each value of channel(A,B,C), we calculte the surprise with JS Divergence after we exlcude this value. Say we get the JS Divergence 0.1 when we exclude channel A,0.3 for B,0.2 for C, then we should split the decision tree at channel A, which means the surprise is the samllest after we exclude A.

That's basically all the theories for this packages. A little difference when we want to deal with a rate indicator(numerator/denominator,like conversion rate,for example CTR=click/impression),but we can transfer it to a same question. For a dimesion say channel with two value of A and B,we can change
CTR
=click/impression
=(click_A+click_B)/(impression_A+impression_B)
=click_A/impression_A*impression_A/(impression_A+impression_B)+click_B/impression_B*impression_B/(impression_A+impression_B)
=CTR_A*impressoin_share_A+CTR_B*impressoin_share_B.
In this way we can transfer a rate problem to a quantity problem.

Install

pip install dimtributor

Example: for quantity indicator

import pandas as pd 
import numpy as np
from dimtributor.tree import DimtributorTree

df_quantity = pd.read_csv('data_quantity.csv')
df_quantity.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	group_name	app_name	platform	country	language	create_datediff	orders
5718	this	ms	android	S	r	8-30day	28
14928	base	ms	android	R	v	1-7day	1
12203	base	ems	android	R	n	8-30day	3
2810	base	ms	android	s	h	31-90day	181
2557	base	ms	android	N	r	>90day	228

group = 'group_name' 
#columns name for group,with two values of base and this
#,if the dau of 20220102 dropped 30% compare to 20220101
#,then the value of group for 20230101 is "base"
#,the the value of group for 20230102 is "this"

value_type = 'quantity' # you are deal with a quantity indicator
dims = ["app_name","platform","country","language","create_datediff"] #all the dimensions you interested
y = 'orders' # the indicator you want to analysis
d = '' #place holder for quantity
n = '' # place holder for quantity
max_tree_depth = 3 #default value 3
min_root_weight = 0.3 #default value 0.3

dmat_quantity = DimtributorTree(df_quantity,value_type,group,dims,y,d,n,max_tree_depth,min_root_weight)
dmat_quantity.createtree()
print("outtree:",dmat_quantity.outtree)
print("png_path:",dmat_quantity.png_path)

png

outtree: {'Decrease:2137268\n27318803->25181535,-7.82%\n account for root:100.0%': {'create_datediff:0day\n account for parent:69.62%': 'create_datediff:0day\nDecrease:1488059\n1823660->335601,-81.6%\n account for root:69.62%', 'create_datediff:not 0day\n account for parent:30.38%': 'create_datediff:not 0day\nDecrease:649209\n25495143->24845934,-2.55%\n account for root:30.38%'}}
png_path: /Users/kennyzhangchao/Desktop/starx_myself/python_code/dimtributor_copy/dim_attribution_1697851303.png

Example: for rate indicator

import pandas as pd 
import numpy as np
from dimtributor.tree import DimtributorTree

df_rate = pd.read_csv('data_rate.csv')
df_rate.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	group_name	app_name	platform	region	language	denominator	numerator	conversion_rate
2603	this	ems	android	W	h	1	0	0.000000
18	this	ms	android	R	r	41352	16315	0.394540
34	base	ms	android	U	u	23386	10042	0.429402
1043	base	ms	android	E	w	8	1	0.125000
356	this	tv	android	N	a	185	94	0.508108

group = 'group_name' 
#columns name for group,with two values of base and this
#,if the conversion_rate of 20220102 dropped 1% compare to 20220101
#,then the value of group for 20230101 is "base"
#,the the value of group for 20230102 is "this"

value_type = 'rate' # you are deal with a rate(or conversion) indicator
dims = ["app_name","platform","region","language"] #all the dimensions you interested
y = 'conversion_rate' # the indicator you want to analysis
d = 'denominator' #must for rate indicators
n = 'numerator' #must for rate indicators
max_tree_depth = 3 #default value 3
min_root_weight = 0.3 #default value 0.3

dmat_rate = DimtributorTree(df_rate,value_type,group,dims,y,d,n,max_tree_depth,min_root_weight)
dmat_rate.createtree()
print("outtree:",dmat_rate.outtree)
print("png_path:",dmat_rate.png_path)

png

outtree: {'Decrease:23.87%\n percentage:100.0%->100.0%=0.0%\n rate:33.88%->10.01%=-23.87%\n account for root:100.0%': {'region:S\n account for parent:74.38%': 'region:S\nDecrease:17.76%\n percentage:62.87%->74.94%=12.07%\n rate:30.79%->2.14%=-28.66%\n account for root:74.38%', 'region:not S\n account for parent:25.62%': 'region:not S\nDecrease:6.12%\n percentage:37.13%->25.06%=-12.07%\n rate:39.11%->33.55%=-5.57%\n account for root:25.62%'}}
png_path: /Users/kennyzhangchao/Desktop/starx_myself/python_code/dimtributor_copy/dim_attribution_1697851318.png

References

I updated my code with the reference of Adtributor.
codes related to Desicion Tree referenced from the book of 《Machine Learning in Action》 by Peter Harrington.

Github dimtributor

TO BE DONE:

1、deal with chinese character display in png;
2、use greater than/less than to deal with continues dimensions, like create_datediff;

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.3

Oct 21, 2023

0.0.2

Oct 6, 2023

0.0.1

Oct 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dimtributor-0.0.3.tar.gz (345.5 kB view hashes)

Uploaded Oct 21, 2023 Source

Built Distribution

dimtributor-0.0.3-py3-none-any.whl (10.1 kB view hashes)

Uploaded Oct 21, 2023 Python 3

Hashes for dimtributor-0.0.3.tar.gz

Hashes for dimtributor-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`53f58377db3a34593f30ca47efdf32bf0cf340bb29a758526d990d00afbb35a0`
MD5	`4445909fc66a94a7e002108f20b8f224`
BLAKE2b-256	`92723683c70e4c6996dbe32d01d6dfefeee5caf07f789abf04a7e1e93a9d2adc`

Hashes for dimtributor-0.0.3-py3-none-any.whl

Hashes for dimtributor-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e2ae275b383bb889885e40258c13dea502b268b64f00b460210123b2fb3b776`
MD5	`fae526bcf7007bf9ed16fe1d5816c5a0`
BLAKE2b-256	`224258e682281c41adece1730e09481e0e8c14145965f3fd96741c62c2b5d52d`