This is a package for computing distances among observations of statistical variables, such as: Euclidean, Minkowski, Canberra, Pearson, Mahalanobis, Robust Mahalanobis, Gower, Generalized Gower and Related Metric Scaling (RelMS). A total of 41 statistical distances can be computed.
Project description
PyDistances: A Statistical Distances Python Package
This is a package for computing distances among observations of statistical variables, such as: Euclidean, Minkowski, Canberra, Pearson, Mahalanobis, Robust Mahalanobis, Gower, Generalized Gower and Related Metric Scaling (RelMS). A total of 41 statistical distances can be computed.
Installation
pip install PyDistances
Example of use
import PyDistances
from PyDistances import Euclidean_Dist, Euclidean_Dist_Matrix, Minkowski_Dist, Minkowski_Dist_Matrix, Canberra_Dist, Canberra_Dist_Matrix, Pearson_Dist, Pearson_Dist_Matrix, Mahalanobis_Dist, Mahalanobis_Dist_Matrix, a_b_c_d_Matrix, Sokal_Similarity, Sokal_Dist, Sokal_Dist_Matrix, Jaccard_Similarity, Jaccard_Dist, Jaccard_Dist_Matrix, alpha, Matching_Similarity, Matching_Dist, Matching_Dist_Matrix, Gower_Similarity_Matrix, Gower_Dist_Matrix, Robust_Mahalanobis_Dist, Robust_Mahalanobis_Dist_Matrix, GeneralizedGowerDistance
Getting data
We load the data we are going to work with throughout this tutorial. This data-set is available in the following link: https://github.com/FabioScielzoOrtiz/Distances_Package/blob/master/Tests/House_Price.csv
Data = pd.read_csv('House_Price.csv')
Data = Data.loc[0:150, ['latitude', 'longitude', 'price', 'size_in_m_2', 'balcony_recode', 'private_garden_recode', 'private_gym_recode', 'quality_recode', 'no_of_bathrooms', 'no_of_bedrooms']]
Data_quant = Data.loc[:,['latitude', 'longitude', 'price', 'size_in_m_2']]
Data_binary = Data.loc[:,['balcony_recode', 'private_garden_recode', 'private_gym_recode']]
Data_multiclass = Data.loc[:,['quality_recode', 'no_of_bathrooms', 'no_of_bedrooms']]
Data.head() # p1=4, p2=3, p3=3
latitude | longitude | price | size_in_m_2 | balcony | private_garden | private_gym | quality | no_of_bathrooms | no_of_bedrooms |
---|---|---|---|---|---|---|---|---|---|
25.1132 | 55.1389 | 2.7e+06 | 100.242 | 1 | 0 | 0 | 2 | 2 | 1 |
25.1068 | 55.1512 | 2.85e+06 | 146.973 | 1 | 0 | 0 | 2 | 2 | 2 |
25.0633 | 55.1377 | 1.15e+06 | 181.254 | 1 | 0 | 0 | 2 | 5 | 3 |
25.2273 | 55.3418 | 2.85e+06 | 187.664 | 1 | 0 | 0 | 1 | 3 | 2 |
25.1143 | 55.1398 | 1.7292e+06 | 47.1018 | 0 | 0 | 0 | 2 | 1 | 0 |
Computing Euclidean distance
We compute the Euclidean distance between observation of index 0 and itself.
Euclidean_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:])
0.0
We compute the Euclidean distance between observation of index 0 and the one of index 2.
Euclidean_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:])
1550000.002117049
We compute the Euclidean distances matrix for the data-set Data_quant
.
Euclidean_Dist_Matrix(Data_quant)
array([[ 0. , 150000.00727904, 1550000.00211705, ...,
1500000.00009635, 2700000.01899102, 12100000.00553371],
[ 150000.00727904, 0. , 1700000.00034565, ...,
1650000.00026782, 2550000.0146678 , 11950000.00426352],
[ 1550000.00211705, 1700000.00034565, 0. , ...,
50000.040973 , 4250000.00673279, 13650000.00297389],
...,
[ 1500000.00009635, 1650000.00026782, 50000.040973 , ...,
0. , 4200000.01094663, 13600000.00447653],
[ 2700000.01899102, 2550000.0146678 , 4250000.00673279, ...,
4200000.01094663, 0. , 9400000.00011113],
[12100000.00553371, 11950000.00426352, 13650000.00297389, ...,
13600000.00447653, 9400000.00011113, 0. ]])
Now, we are going to repeat the same procedure with other available distances in PyDistances
.
Computing Minkowski distance
Minkowski_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:], q=1)
0.0
Minkowski_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], q=1)
1550081.062526
Minkowski_Dist_Matrix(Data_quant, q=1)
array([[ 0. , 150046.748877, 1550081.062526, ...,
1500017.050769, 2700320.266531, 12100365.997115],
[ 150046.748877, 0. , 1700034.338187, ...,
1650029.78435 , 2550273.554024, 11950319.272776],
[ 1550081.062526, 1700034.338187, 0. , ...,
50064.027555, 4250239.302851, 13650284.955165],
...,
[ 1500017.050769, 1650029.78435 , 50064.027555, ...,
0. , 4200303.29563 , 13600348.947944],
[ 2700320.266531, 2550273.554024, 4250239.302851, ...,
4200303.29563 , 0. , 9400045.764238],
[12100365.997115, 11950319.272776, 13650284.955165, ...,
13600348.947944, 9400045.764238, 0. ]])
Computing Canberra distance
Canberra_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:])
0.0
Canberra_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:])
0.6913917083019879
Canberra_Dist_Matrix(Data_quant)
array([[0. , 0.21629237, 0.69139171, ..., 0.463675 , 0.9485963 ,
1.33838751],
[0.21629237, 0. , 0.53043317, ..., 0.52079671, 0.79157752,
1.19854721],
[0.69139171, 0.53043317, 0. , ..., 0.23597883, 1.04765637,
1.29619958],
...,
[0.463675 , 0.52079671, 0.23597883, ..., 0. , 1.20126891,
1.44813664],
[0.9485963 , 0.79157752, 1.04765637, ..., 1.20126891, 0. ,
0.51782969],
[1.33838751, 1.19854721, 1.29619958, ..., 1.44813664, 0.51782969,
0. ]])
Computing Pearson distance
Pearson_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:], variance=Data.var())
0.0
Pearson_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], variance=Data.var())
1.5393297661160206
Pearson_Dist_Matrix(Data_quant)
array([[0. , 0.63961801, 1.53932977, ..., 1.03084131, 4.32943281,
7.47171915],
[0.63961801, 0. , 1.20505141, ..., 1.09780711, 3.76643257,
7.04893716],
[1.53932977, 1.20505141, 0. , ..., 0.84617436, 3.79891055,
7.4670243 ],
...,
[1.03084131, 1.09780711, 0.84617436, ..., 0. , 4.44143053,
7.87905955],
[4.32943281, 3.76643257, 3.79891055, ..., 4.44143053, 0. ,
4.57460318],
[7.47171915, 7.04893716, 7.4670243 , ..., 7.87905955, 4.57460318,
0. ]])
Computing Mahalanobis distance
Mahalanobis_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], S_inv=np.linalg.inv( np.cov(Data_quant , rowvar=False) ))
0.0
Mahalanobis_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], S_inv=np.linalg.inv( np.cov(Data_quant , rowvar=False) ))
2.7671855371187757
Mahalanobis_Dist_Matrix(Data_quant)
array([[0. , 0.92801614, 2.76718554, ..., 1.52541554, 5.21105193,
6.45997793],
[0.92801614, 0. , 1.96135599, ..., 0.98693199, 4.43479282,
6.2920865 ],
[2.76718554, 1.96135599, 0. , ..., 1.3592188 , 3.4307313 ,
7.27986558],
...,
[1.52541554, 0.98693199, 1.3592188 , ..., 0. , 4.41360406,
7.01503103],
[5.21105193, 4.43479282, 3.4307313 , ..., 4.41360406, 0. ,
7.4691448 ],
[6.45997793, 6.2920865 , 7.27986558, ..., 7.01503103, 7.4691448 ,
0. ]])
Computing Sokal similarity
a,b,c,d,p = a_b_c_d_Matrix(Data_binary)
Sokal_Similarity(i=0, r=2, a=a, d=d, p=p)
1.0
Sokal_Dist(i=0, r=2, a=a, d=d, p=p)
0.0
Sokal_Dist_Matrix(Data_binary)
array([[0. , 0. , 0. , ..., 0. , 0. ,
0.81649658],
[0. , 0. , 0. , ..., 0. , 0. ,
0.81649658],
[0. , 0. , 0. , ..., 0. , 0. ,
0.81649658],
...,
[0. , 0. , 0. , ..., 0. , 0. ,
0.81649658],
[0. , 0. , 0. , ..., 0. , 0. ,
0.81649658],
[0.81649658, 0.81649658, 0.81649658, ..., 0.81649658, 0.81649658,
0. ]])
Computing Jaccard similarity
Jaccard_Similarity(i=0, r=2, a=a, d=d, p=p)
1.0
Jaccard_Dist(i=0, r=2, a=a, d=d, p=p)
0.0
Jaccard_Dist_Matrix(Data_binary)
array([[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 0., 1.],
...,
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 0., 1.],
[1., 1., 1., ..., 1., 1., 0.]])
Computing Matching similarity
Matching_Similarity(x_i=Data_multiclass.iloc[0,:], x_r=Data_multiclass.iloc[2,:], Data=Data_multiclass)
0.3333333333333333
Matching_Dist(x_i=Data_multiclass.iloc[0,:], x_r=Data_multiclass.iloc[2,:], Data=Data_multiclass)
1.1547005383792517
Matching_Dist_Matrix(Data_multiclass)
array([[0. , 0.81649658, 1.15470054, ..., 0.81649658, 1.15470054,
1.41421356],
[0.81649658, 0. , 1.15470054, ..., 0. , 1.15470054,
1.41421356],
[1.15470054, 1.15470054, 0. , ..., 1.15470054, 0.81649658,
1.15470054],
...,
[0.81649658, 0. , 1.15470054, ..., 0. , 1.15470054,
1.41421356],
[1.15470054, 1.15470054, 0.81649658, ..., 1.15470054, 0. ,
1.15470054],
[1.41421356, 1.41421356, 1.15470054, ..., 1.41421356, 1.15470054,
0. ]])
Computing Gower distance
From a theoretical perspective Gower (1971) has been followed.
Gower_Similarity_Matrix(Data, p1=4, p2=3, p3=3)
array([[1. , 0.85175283, 0.68485131, ..., 0.83008431, 0.62482353,
0.34709882],
[0.85175283, 1. , 0.69489168, ..., 0.94863663, 0.63064768,
0.35833279],
[0.68485131, 0.69489168, 1. , ..., 0.72293677, 0.73120218,
0.48172501],
...,
[0.83008431, 0.94863663, 0.72293677, ..., 1. , 0.59776459,
0.36311382],
[0.62482353, 0.63064768, 0.73120218, ..., 0.59776459, 1. ,
0.55654437],
[0.34709882, 0.35833279, 0.48172501, ..., 0.36311382, 0.55654437,
1. ]])
Gower_Dist_Matrix(Data, p1=4, p2=3, p3=3)
array([[0. , 0.38502879, 0.56138105, ..., 0.41220831, 0.61251651,
0.808023 ],
[0.38502879, 0. , 0.55236611, ..., 0.22663488, 0.60774363,
0.80104133],
[0.56138105, 0.55236611, 0. , ..., 0.52636796, 0.51845716,
0.71991318],
...,
[0.41220831, 0.22663488, 0.52636796, ..., 0. , 0.63422032,
0.79805149],
[0.61251651, 0.60774363, 0.51845716, ..., 0.63422032, 0. ,
0.66592464],
[0.808023 , 0.80104133, 0.71991318, ..., 0.79805149, 0.66592464,
0. ]])
Computing Robust Mahalanobis distance
From a theoretical perspective Gnanadesikan (1997) and Delvin et al. (1975) have been followed.
Robust_Mahalanobis_Dist(x_i=Data_quant.iloc[0,:], x_r=Data_quant.iloc[2,:], Data=Data_quant, Method='MAD', epsilon=0.05, n_iters=20)
2.1448247626892223
Robust_Mahalanobis_Dist(x_i=Data_quant.iloc[0,:], x_r=Data_quant.iloc[2,:], Data=Data_quant, Method='trimmed', alpha=0.1, epsilon=0.05, n_iters=20)
2.7434709885399884
Robust_Mahalanobis_Dist(x_i=Data_quant.iloc[0,:], x_r=Data_quant.iloc[2,:], Data=Data_quant, Method='winsorized', alpha=0.1, epsilon=0.05, n_iters=20)
2.8446274140577943
Robust_Mahalanobis_Dist_Matrix(Data=Data_quant, Method='trimmed', alpha=0.1, epsilon=0.05, n_iters=20)
array([[ 0. , 0.89250845, 2.74347099, ..., 1.48503889,
5.95276234, 8.49453068],
[ 0.89250845, 0. , 1.99959936, ..., 0.96839524,
5.33355737, 8.32070442],
[ 2.74347099, 1.99959936, 0. , ..., 1.36336733,
4.12306341, 9.38094479],
...,
[ 1.48503889, 0.96839524, 1.36336733, ..., 0. ,
5.1322854 , 9.00337923],
[ 5.95276234, 5.33355737, 4.12306341, ..., 5.1322854 ,
0. , 11.06785954],
[ 8.49453068, 8.32070442, 9.38094479, ..., 9.00337923,
11.06785954, 0. ]])
Computing Generalized Gower distance and Releted Metric Scaling
To end this tutorial we are going to compute both the Gower distance matrix and the Related Metric Scaling matrix for the mixed data-set Data
. And we are going to do that considering all the possible combinations of the quantitative, binary and multiclass distances. Then, we will save all the resulting matrix in a Python dictionary.
From a theoretical perspective we have followed Cuadras and Fortiana (1998), Albarrán et al. (2015) and Grané et al. (2021).
D_GG_list_maha_robust = []
D_RelMS_list_maha_robust = []
D_GG_list_not_maha_robust = []
D_RelMS_list_not_maha_robust = []
d1_list = ['Euclidean', 'Minkowski', 'Canberra', 'Pearson', 'Mahalanobis']
d2_list = ['Sokal', 'Jaccard']
d3_list = ['Matching']
for d in itertools.product(d1_list, d2_list, d3_list) :
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], q=1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)
D_GG_list_not_maha_robust.append(D)
for d in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD']) :
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], epsilon=0.05, Method=d[3], alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)
D_GG_list_maha_robust.append(D)
for d in itertools.product(d1_list, d2_list, d3_list) :
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], q=1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True, tol=0.009, d=2)
D_RelMS_list_not_maha_robust.append(D)
for d in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD']) :
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], epsilon=0.05, Method=d[3], alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True, tol=0.009, d=2)
D_RelMS_list_maha_robust.append(D)
D_GG_list = D_GG_list_not_maha_robust + D_GG_list_maha_robust
D_RelMS_list = D_RelMS_list_not_maha_robust + D_RelMS_list_maha_robust
search_space = [x for x in D_GG_list] + [x for x in D_RelMS_list]
distance_names = ['GG_'+x[0]+'_'+x[1]+'_'+x[2] for x in itertools.product(d1_list, d2_list, d3_list)] + ['GG_'+x[0]+'_'+x[1]+'_'+x[2]+'_'+x[3] for x in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD'])] + ['RelMS_'+x[0]+'_'+x[1]+'_'+x[2] for x in itertools.product(d1_list, d2_list, d3_list)] + ['RelMS_'+x[0]+'_'+x[1]+'_'+x[2]+'_'+x[3] for x in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD'])]
dic_distance_matrix = dict(zip(distance_names, search_space))
dic_distance_matrix
{'GG_Euclidean_Sokal_Matching': array([[0. , 1.01161446, 1.60800698, ..., 1.23798333, 1.92432848,
6.35838514],
[1.01161446, 0. , 1.64229596, ..., 0.7889253 , 1.87696727,
6.29319748],
[1.60800698, 1.64229596, 0. , ..., 1.42723912, 2.26882579,
6.96673669],
...,
[1.23798333, 0.7889253 , 1.42723912, ..., 0. , 2.4635748 ,
7.01727531],
[1.92432848, 1.87696727, 2.26882579, ..., 2.4635748 , 0. ,
5.11270638],
[6.35838514, 6.29319748, 6.96673669, ..., 7.01727531, 5.11270638,
0. ]]),
'GG_Euclidean_Jaccard_Matching': array([[0. , 1.01161446, 1.60800698, ..., 1.23798333, 1.92432848,
6.21923207],
[1.01161446, 0. , 1.64229596, ..., 0.7889253 , 1.87696727,
6.15257024],
[1.60800698, 1.64229596, 0. , ..., 1.42723912, 2.26882579,
6.83997121],
...,
[1.23798333, 0.7889253 , 1.42723912, ..., 0. , 2.4635748 ,
6.89143953],
[1.92432848, 1.87696727, 2.26882579, ..., 2.4635748 , 0. ,
4.93857798],
[6.21923207, 6.15257024, 6.83997121, ..., 6.89143953, 4.93857798,
0. ]]),
'GG_Minkowski_Sokal_Matching': array([[0. , 1.01161589, 1.60801451, ..., 1.23797549, 1.92440501,
6.35838512],
[1.01161589, 0. , 1.64229192, ..., 0.78891568, 1.87702827,
6.29317915],
[1.60801451, 1.64229192, 0. , ..., 1.42723962, 2.2688732 ,
6.96667937],
...,
[1.23797549, 0.78891568, 1.42723962, ..., 0. , 2.46364348,
7.01724763],
[1.92440501, 1.87702827, 2.2688732 , ..., 2.46364348, 0. ,
5.11260609],
[6.35838512, 6.29317915, 6.96667937, ..., 7.01724763, 5.11260609,
0. ]]),
'GG_Minkowski_Jaccard_Matching': array([[0. , 1.01161589, 1.60801451, ..., 1.23797549, 1.92440501,
6.21923205],
[1.01161589, 0. , 1.64229192, ..., 0.78891568, 1.87702827,
6.15255149],
[1.60801451, 1.64229192, 0. , ..., 1.42723962, 2.2688732 ,
6.83991282],
...,
[1.23797549, 0.78891568, 1.42723962, ..., 0. , 2.46364348,
6.89141134],
[1.92440501, 1.87702827, 2.2688732 , ..., 2.46364348, 0. ,
4.93847416],
[6.21923205, 6.15255149, 6.83991282, ..., 6.89141134, 4.93847416,
0. ]]),
'GG_Canberra_Sokal_Matching': array([[0. , 1.1089173 , 2.04873576, ..., 1.41070641, 2.47064802,
3.88007815],
[1.1089173 , 0. , 1.81887649, ..., 1.10728448, 2.20656591,
3.66760203],
[2.04873576, 1.81887649, 0. , ..., 1.51266848, 2.44536222,
3.67890583],
...,
[1.41070641, 1.10728448, 1.51266848, ..., 0. , 2.92569072,
4.05431191],
[2.47064802, 2.20656591, 2.44536222, ..., 2.92569072, 0. ,
2.67423498],
[3.88007815, 3.66760203, 3.67890583, ..., 4.05431191, 2.67423498,
0. ]]),
'GG_Canberra_Jaccard_Matching': array([[0. , 1.1089173 , 2.04873576, ..., 1.41070641, 2.47064802,
3.64757349],
[1.1089173 , 0. , 1.81887649, ..., 1.10728448, 2.20656591,
3.42068569],
[2.04873576, 1.81887649, 0. , ..., 1.51266848, 2.44536222,
3.43280265],
...,
[1.41070641, 1.10728448, 1.51266848, ..., 0. , 2.92569072,
3.83239234],
[2.47064802, 2.20656591, 2.44536222, ..., 2.92569072, 0. ,
2.32407372],
[3.64757349, 3.42068569, 3.43280265, ..., 3.83239234, 2.32407372,
0. ]]),
'GG_Pearson_Sokal_Matching': array([[0. , 1.0588577 , 1.62258227, ..., 1.13386485, 2.59878376,
4.5833716 ],
[1.0588577 , 0. , 1.54980561, ..., 0.55073019, 2.36782324,
4.41160916],
[1.62258227, 1.54980561, 0. , ..., 1.48883715, 2.15643298,
4.46893998],
...,
[1.13386485, 0.55073019, 1.48883715, ..., 0. , 2.64592015,
4.75194328],
[2.59878376, 2.36782324, 2.15643298, ..., 2.64592015, 0. ,
3.34753806],
[4.5833716 , 4.41160916, 4.46893998, ..., 4.75194328, 3.34753806,
0. ]]),
'GG_Pearson_Jaccard_Matching': array([[0. , 1.0588577 , 1.62258227, ..., 1.13386485, 2.59878376,
4.38828909],
[1.0588577 , 0. , 1.54980561, ..., 0.55073019, 2.36782324,
4.20857237],
[1.62258227, 1.54980561, 0. , ..., 1.48883715, 2.15643298,
4.26863098],
...,
[1.13386485, 0.55073019, 1.48883715, ..., 0. , 2.64592015,
4.56407174],
[2.59878376, 2.36782324, 2.15643298, ..., 2.64592015, 0. ,
3.07502796],
[4.38828909, 4.20857237, 4.26863098, ..., 4.56407174, 3.07502796,
0. ]]),
'GG_Mahalanobis_Sokal_Matching': array([[0. , 1.11128701, 1.9908619 , ..., 1.26642065, 2.97833241,
4.17851469],
[1.11128701, 0. , 1.73337267, ..., 0.49510815, 2.64311668,
4.11353573],
[1.9908619 , 1.73337267, 0. , ..., 1.5815777 , 1.99507289,
4.39053781],
...,
[1.26642065, 0.49510815, 1.5815777 , ..., 0. , 2.63417571,
4.3979867 ],
[2.97833241, 2.64311668, 1.99507289, ..., 2.63417571, 0. ,
4.4698317 ],
[4.17851469, 4.11353573, 4.39053781, ..., 4.3979867 , 4.4698317 ,
0. ]]),
'GG_Mahalanobis_Jaccard_Matching': array([[0. , 1.11128701, 1.9908619 , ..., 1.26642065, 2.97833241,
3.96355535],
[1.11128701, 0. , 1.73337267, ..., 0.49510815, 2.64311668,
3.89499193],
[1.9908619 , 1.73337267, 0. , ..., 1.5815777 , 1.99507289,
4.18647921],
...,
[1.26642065, 0.49510815, 1.5815777 , ..., 0. , 2.63417571,
4.19429052],
[2.97833241, 2.64311668, 1.99507289, ..., 2.63417571, 0. ,
4.26956454],
[3.96355535, 3.89499193, 4.18647921, ..., 4.19429052, 4.26956454,
0. ]]),
'GG_Robust_Mahalanobis_Sokal_Matching_trimmed': array([[0. , 1.0738818 , 1.81990287, ..., 1.17982158, 2.83584093,
4.38026385],
[1.0738818 , 0. , 1.64744788, ..., 0.39866732, 2.61869851,
4.3233478 ],
[1.81990287, 1.64744788, 0. , ..., 1.53344794, 1.97466567,
4.56660697],
...,
[1.17982158, 0.39866732, 1.53344794, ..., 0. , 2.54962302,
4.5492545 ],
[2.83584093, 2.61869851, 1.97466567, ..., 2.54962302, 0. ,
5.16721825],
[4.38026385, 4.3233478 , 4.56660697, ..., 4.5492545 , 5.16721825,
0. ]]),
'GG_Robust_Mahalanobis_Sokal_Matching_winsorized': array([[0. , 1.10035027, 1.96521318, ..., 1.24876507, 3.02193061,
4.2158267 ],
[1.10035027, 0. , 1.72244788, ..., 0.45786845, 2.71169847,
4.170886 ],
[1.96521318, 1.72244788, 0. , ..., 1.57396145, 2.01907767,
4.45138733],
...,
[1.24876507, 0.45786845, 1.57396145, ..., 0. , 2.6589383 ,
4.42575055],
[3.02193061, 2.71169847, 2.01907767, ..., 2.6589383 , 0. ,
4.74960743],
[4.2158267 , 4.170886 , 4.45138733, ..., 4.42575055, 4.74960743,
0. ]]),
'GG_Robust_Mahalanobis_Sokal_Matching_MAD': array([[0. , 1.09006233, 1.80375514, ..., 1.18201607, 2.67497233,
4.55678538],
[1.09006233, 0. , 1.62058379, ..., 0.44488228, 2.40606721,
4.40232615],
[1.80375514, 1.62058379, 0. , ..., 1.53278692, 1.93813141,
4.46679441],
...,
[1.18201607, 0.44488228, 1.53278692, ..., 0. , 2.48916367,
4.64371521],
[2.67497233, 2.40606721, 1.93813141, ..., 2.48916367, 0. ,
4.16671594],
[4.55678538, 4.40232615, 4.46679441, ..., 4.64371521, 4.16671594,
0. ]]),
'GG_Robust_Mahalanobis_Jaccard_Matching_trimmed': array([[0. , 1.0738818 , 1.81990287, ..., 1.17982158, 2.83584093,
4.17570322],
[1.0738818 , 0. , 1.64744788, ..., 0.39866732, 2.61869851,
4.11595944],
[1.81990287, 1.64744788, 0. , ..., 1.53344794, 1.97466567,
4.37077626],
...,
[1.17982158, 0.39866732, 1.53344794, ..., 0. , 2.54962302,
4.35264315],
[2.83584093, 2.61869851, 1.97466567, ..., 2.54962302, 0. ,
4.99499053],
[4.17570322, 4.11595944, 4.37077626, ..., 4.35264315, 4.99499053,
0. ]]),
'GG_Robust_Mahalanobis_Jaccard_Matching_winsorized': array([[0. , 1.10035027, 1.96521318, ..., 1.24876507, 3.02193061,
4.00287155],
[1.10035027, 0. , 1.72244788, ..., 0.45786845, 2.71169847,
3.95551209],
[1.96521318, 1.72244788, 0. , ..., 1.57396145, 2.01907767,
4.25025118],
...,
[1.24876507, 0.45786845, 1.57396145, ..., 0. , 2.6589383 ,
4.22339365],
[3.02193061, 2.71169847, 2.01907767, ..., 2.6589383 , 0. ,
4.5616397 ],
[4.00287155, 3.95551209, 4.25025118, ..., 4.22339365, 4.5616397 ,
0. ]]),
'GG_Robust_Mahalanobis_Jaccard_Matching_MAD': array([[0. , 1.09006233, 1.80375514, ..., 1.18201607, 2.67497233,
4.36051361],
[1.09006233, 0. , 1.62058379, ..., 0.44488228, 2.40606721,
4.19884049],
[1.80375514, 1.62058379, 0. , ..., 1.53278692, 1.93813141,
4.26638468],
...,
[1.18201607, 0.44488228, 1.53278692, ..., 0. , 2.48916367,
4.45127812],
[2.67497233, 2.40606721, 1.93813141, ..., 2.48916367, 0. ,
3.95111474],
[4.36051361, 4.19884049, 4.26638468, ..., 4.45127812, 3.95111474,
0. ]]),
'RelMS_Euclidean_Sokal_Matching': array([[0. , 1.01092438, 1.68587263, ..., 1.2435966 , 1.75479379,
5.76354972],
[1.01092436, 0. , 1.72123768, ..., 0.78892531, 1.71977376,
5.69924943],
[1.68587264, 1.7212377 , 0. , ..., 1.42997022, 2.20660915,
6.5504967 ],
...,
[1.24359658, 0.78892532, 1.42997021, ..., 0. , 2.26671431,
6.42377887],
[1.7547938 , 1.71977375, 2.20660914, ..., 2.26671431, 0. ,
4.781135 ],
[5.76354972, 5.69924943, 6.55049671, ..., 6.42377887, 4.78113499,
0. ]]),
'RelMS_Euclidean_Jaccard_Matching': array([[0. , 1.01092435, 1.68587263, ..., 1.24359659, 1.75479381,
5.73873464],
[1.01092437, 0. , 1.72123769, ..., 0.78892532, 1.71977378,
5.67208311],
[1.68587264, 1.72123769, 0. , ..., 1.42997021, 2.20660914,
6.53309456],
...,
[1.24359658, 0.78892529, 1.42997021, ..., 0. , 2.26671431,
6.41402297],
[1.7547938 , 1.71977375, 2.20660914, ..., 2.2667143 , 0. ,
4.6957284 ],
[5.73873463, 5.67208312, 6.53309457, ..., 6.41402297, 4.69572838,
0. ]]),
'RelMS_Minkowski_Sokal_Matching': array([[0. , 1.0104344 , 1.68473307, ..., 1.24302039, 1.75451827,
5.7636572 ],
[1.01043437, 0. , 1.72039524, ..., 0.78891568, 1.71978231,
5.69946617],
[1.68473308, 1.72039525, 0. , ..., 1.42922921, 2.20651554,
6.55109162],
...,
[1.24302037, 0.7889157 , 1.4292292 , ..., 0. , 2.2667207 ,
6.42402052],
[1.75451827, 1.71978229, 2.20651553, ..., 2.2667207 , 0. ,
4.78235997],
[5.7636572 , 5.69946616, 6.55109161, ..., 6.42402052, 4.78235997,
0. ]]),
'RelMS_Minkowski_Jaccard_Matching': array([[0. , 1.01043437, 1.68473307, ..., 1.24302038, 1.75451828,
5.73875343],
[1.01043439, 0. , 1.72039525, ..., 0.78891569, 1.71978232,
5.67221733],
[1.68473307, 1.72039524, 0. , ..., 1.4292292 , 2.20651553,
6.5336026 ],
...,
[1.24302038, 0.78891568, 1.4292292 , ..., 0. , 2.2667207 ,
6.41417732],
[1.75451828, 1.7197823 , 2.20651553, ..., 2.2667207 , 0. ,
4.6969009 ],
[5.73875342, 5.67221732, 6.5336026 , ..., 6.41417732, 4.6969009 ,
0. ]]),
'RelMS_Canberra_Sokal_Matching': array([[0. , 3.29475825, 3.63767326, ..., 3.42002989, 3.78234978,
4.28387746],
[3.29475817, 0. , 3.54627477, ..., 3.36365755, 3.64707779,
4.11290306],
[3.63767327, 3.5462748 , 0. , ..., 3.36371231, 3.88636668,
4.26421609],
...,
[3.42002989, 3.36365756, 3.36371231, ..., 0. , 4.08835735,
4.43146723],
[3.78234979, 3.64707779, 3.88636667, ..., 4.08835736, 0. ,
3.55682862],
[4.28387745, 4.11290305, 4.26421607, ..., 4.43146723, 3.55682862,
0. ]]),
'RelMS_Canberra_Jaccard_Matching': array([[0. , 3.29475816, 3.63767325, ..., 3.42002988, 3.7823498 ,
4.18398249],
[3.29475818, 0. , 3.54627479, ..., 3.36365756, 3.64707782,
4.00084943],
[3.63767326, 3.54627478, 0. , ..., 3.36371229, 3.88636666,
4.15092751],
...,
[3.42002988, 3.36365755, 3.36371228, ..., 0. , 4.08835736,
4.3378168 ],
[3.78234979, 3.64707778, 3.88636666, ..., 4.08835735, 0. ,
3.36218137],
[4.18398248, 4.00084941, 4.15092752, ..., 4.3378168 , 3.36218137,
0. ]]),
'RelMS_Pearson_Sokal_Matching': array([[0. , 1.04250916, 1.57029271, ..., 1.11835441, 2.35030151,
3.99961285],
[1.04250913, 0. , 1.55642417, ..., 0.55073019, 2.17276224,
3.83629275],
[1.5702927 , 1.55642418, 0. , ..., 1.44481248, 2.11094744,
4.05200057],
...,
[1.11835439, 0.55073021, 1.44481248, ..., 0. , 2.43447697,
4.16544183],
[2.35030151, 2.17276223, 2.11094745, ..., 2.43447697, 0. ,
3.00502738],
[3.99961283, 3.83629274, 4.05200056, ..., 4.16544183, 3.00502738,
0. ]]),
'RelMS_Pearson_Jaccard_Matching': array([[0. , 1.04250913, 1.57029271, ..., 1.11835441, 2.35030152,
3.89789603],
[1.04250915, 0. , 1.55642418, ..., 0.55073023, 2.17276226,
3.72479069],
[1.5702927 , 1.55642415, 0. , ..., 1.44481247, 2.11094744,
3.94329467],
...,
[1.11835439, 0.55073016, 1.44481248, ..., 0. , 2.43447698,
4.07654071],
[2.35030152, 2.17276223, 2.11094745, ..., 2.43447697, 0. ,
2.77842982],
[3.89789601, 3.72479067, 3.94329467, ..., 4.0765407 , 2.77842982,
0. ]]),
'RelMS_Mahalanobis_Sokal_Matching': array([[0. , 1.0872495 , 1.91566724, ..., 1.23718333, 2.78694322,
3.59368169],
[1.08724948, 0. , 1.72190382, ..., 0.49510814, 2.51013925,
3.52430362],
[1.91566725, 1.72190383, 0. , ..., 1.53860587, 1.97114821,
3.91897956],
...,
[1.23718333, 0.49510818, 1.53860586, ..., 0. , 2.47401146,
3.7944967 ],
[2.78694323, 2.51013924, 1.97114821, ..., 2.47401146, 0. ,
4.10401609],
[3.59368167, 3.52430361, 3.91897955, ..., 3.7944967 , 4.10401609,
0. ]]),
'RelMS_Mahalanobis_Jaccard_Matching': array([[0. , 1.08724947, 1.91566724, ..., 1.23718333, 2.78694323,
3.46907215],
[1.0872495 , 0. , 1.72190383, ..., 0.49510817, 2.51013926,
3.39550188],
[1.91566724, 1.72190381, 0. , ..., 1.53860586, 1.97114821,
3.80535063],
...,
[1.23718333, 0.49510812, 1.53860586, ..., 0. , 2.47401147,
3.68911387],
[2.78694323, 2.51013924, 1.97114821, ..., 2.47401147, 0. ,
3.96214705],
[3.46907213, 3.39550187, 3.80535063, ..., 3.68911387, 3.96214705,
0. ]]),
'RelMS_Robust_Mahalanobis_Sokal_Matching_trimmed': array([[0. , 1.05396495, 1.74951184, ..., 1.15390312, 2.67058462,
3.82780883],
[1.05396493, 0. , 1.63479812, ..., 0.39866731, 2.51224528,
3.76362714],
[1.74951185, 1.63479814, 0. , ..., 1.49657109, 1.961588 ,
4.09825745],
...,
[1.15390311, 0.39866735, 1.49657109, ..., 0. , 2.41854434,
3.97375586],
[2.67058463, 2.51224527, 1.961588 , ..., 2.41854434, 0. ,
4.81269468],
[3.82780882, 3.76362713, 4.09825744, ..., 3.97375586, 4.81269468,
0. ]]),
'RelMS_Robust_Mahalanobis_Sokal_Matching_winsorized': array([[0. , 1.07688717, 1.88851059, ..., 1.21940102, 2.83800382,
3.64003684],
[1.07688713, 0. , 1.70819251, ..., 0.45786842, 2.58662722,
3.59029333],
[1.8885106 , 1.70819253, 0. , ..., 1.53220354, 1.99808026,
3.97860895],
...,
[1.21940101, 0.45786849, 1.53220353, ..., 0. , 2.50787408,
3.829693 ],
[2.83800382, 2.58662721, 1.99808026, ..., 2.50787408, 0. ,
4.38739858],
[3.64003683, 3.59029333, 3.97860894, ..., 3.829693 , 4.38739858,
0. ]]),
'RelMS_Robust_Mahalanobis_Sokal_Matching_MAD': array([[0. , 1.06915308, 1.73228661, ..., 1.15789936, 2.45834684,
3.97049139],
[1.06915305, 0. , 1.61195487, ..., 0.44488227, 2.24973009,
3.81621214],
[1.73228661, 1.61195488, 0. , ..., 1.4894837 , 1.90536576,
4.00431571],
...,
[1.15789934, 0.44488231, 1.4894837 , ..., 0. , 2.30824179,
4.04102682],
[2.45834685, 2.24973009, 1.90536577, ..., 2.30824178, 0. ,
3.79967402],
[3.97049139, 3.81621213, 4.0043157 , ..., 4.04102682, 3.79967402,
0. ]]),
'RelMS_Robust_Mahalanobis_Jaccard_Matching_trimmed': array([[0. , 1.05396492, 1.74951184, ..., 1.15390312, 2.67058463,
3.7103996 ],
[1.05396495, 0. , 1.63479813, ..., 0.39866734, 2.51224529,
3.64245313],
[1.74951185, 1.63479812, 0. , ..., 1.49657109, 1.961588 ,
3.98729219],
...,
[1.15390311, 0.39866728, 1.49657109, ..., 0. , 2.41854435,
3.87035377],
[2.67058464, 2.51224527, 1.961588 , ..., 2.41854434, 0. ,
4.69932707],
[3.71039959, 3.64245311, 3.9872922 , ..., 3.87035377, 4.69932707,
0. ]]),
'RelMS_Robust_Mahalanobis_Jaccard_Matching_winsorized': array([[0. , 1.07688714, 1.88851059, ..., 1.21940102, 2.83800383,
3.51619033],
[1.07688715, 0. , 1.70819252, ..., 0.45786846, 2.58662723,
3.46347473],
[1.88851059, 1.70819251, 0. , ..., 1.53220354, 1.99808026,
3.86606614],
...,
[1.219401 , 0.45786843, 1.53220353, ..., 0. , 2.50787409,
3.72394257],
[2.83800382, 2.58662721, 1.99808026, ..., 2.50787408, 0. ,
4.25828147],
[3.51619032, 3.46347472, 3.86606614, ..., 3.72394256, 4.25828147,
0. ]]),
'RelMS_Robust_Mahalanobis_Jaccard_Matching_MAD': array([[0. , 1.06915304, 1.73228661, ..., 1.15789935, 2.45834686,
3.86694579],
[1.06915307, 0. , 1.61195488, ..., 0.4448823 , 2.24973011,
3.7045599 ],
[1.7322866 , 1.61195486, 0. , ..., 1.48948369, 1.90536575,
3.89571711],
...,
[1.15789934, 0.44488225, 1.48948369, ..., 0. , 2.30824179,
3.9478467 ],
[2.45834686, 2.24973009, 1.90536576, ..., 2.30824179, 0. ,
3.64285626],
[3.86694578, 3.70455988, 3.8957171 , ..., 3.9478467 , 3.64285626,
0. ]])}
Computational Cost Testing
In this case, we are going to use the entire House_Price.csv
dataset, which has 1905 rows, to perform a computational cost test (in terms of time) of the new distance metrics included in PyDistances
.
Data = pd.read_csv('House_Price.csv')
Data = Data.loc[:, ['latitude', 'longitude', 'price', 'size_in_m_2', 'balcony_recode', 'private_garden_recode', 'private_gym_recode', 'quality_recode', 'no_of_bathrooms', 'no_of_bedrooms']]
Data.shape
(1905, 10)
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='trimmed', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)
# Time: 1.11 minutes.
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='winsorized', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)
# Time: 1.15 minutes.
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='MAD', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)
# Time: 1.12 minutes.
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='trimmed', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True)
# Time: 1.58 minutes.
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='winsorized', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True)
# Time: 1.53 minutes.
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='MAD', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True)
# Time: 1.55 minutes.
We can compare these times with the one obtained by (simple) Gower distance.
Gower_Dist_Matrix(Data, p1=4, p2=3, p3=3)
# Time: 38 seconds.
Bibliography
Albarrán, I., P. Alonso, and A. Grané “Profile Identification via Weighted Related Metric Scaling: An Application to Dependent Spanish Children.” Journal of the Royal Statistical Society. Series A, Statistics in Society 178, no. 3 (2015): 593–618. https://doi.org/10.1111/rssa.12084stex:B88856BB540BB0134A72028E02D7B00CBED08217.
Cuadras, C. M., and J. Fortiana. “Chapter 25 - Visualizing Categorical Data with Related Metric Scaling.” In Visualization of Categorical Data, 365–76. Academic Press, 1998. https://doi.org/10.1016/B978-012299045-8/50028-0.
Devlin, S. J., R. Gnanadesikan, and J. R. Kettenring. “Robust Estimation and Outlier Detection with Correlation Coefficients.” Biometrika 62, no. 3 (1975): 531–45. https://doi.org/10.1093/biomet/62.3.531.
Grané, A., Manzi G. and S. Salini. "Smart Visualization of Mixed Data". Stats n.º 4 (2021): 472–485. https://doi.org/10.3390/stats4020029
Gower, J. C. “A General Coefficient of Similarity and Some of Its Properties.” Biometrics 27, no. 4 (1971): 857–71. https://doi.org/10.2307/2528823.
Gnanadesikan, R. Methods for Statistical Data Analysis of Multivariate Observations. 2nd ed. New York etc.: : John Wiley and Sons, 1997.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for PyDistances-0.0.21-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d13782d3ef86d6f13e3064e87cf7d7015b32a849176120c0e6500a3b993396a7 |
|
MD5 | 72ccf84098f952ac0632aa0f3a470716 |
|
BLAKE2b-256 | 48234f7588bf0bb2cecb8a4336b4707f2ebbf44431828fe5fa1b784f1d953067 |