Python implementation of Gowers distance, pairwise between records in two data sets with multiprocessing support
Project description
Introduction
Gower's distance calculation in Python. Gower Distance is a distance measure that can be used to calculate distance between two entity whose attribute has a mixed of categorical and numerical values. Gower (1971) A general coefficient of similarity and some of its properties. Biometrics 27 857–874.
More details and examples can be found on my personal website here:(https://www.thinkdatascience.com/post/2019-12-16-introducing-python-package-gower/)
Core functions are wrote by Marcelo Beckmann.
Multiprocessing added by Szymon Bobek
Examples
Installation
pip install gower-multiprocessing
Generate some data
import numpy as np
import pandas as pd
import gower-multiprocessing as gower
Xd=pd.DataFrame({'age':[21,21,19, 30,21,21,19,30,None],
'gender':['M','M','N','M','F','F','F','F',None],
'civil_status':['MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED',None],
'salary':[3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0,None],
'has_children':[1,0,1,1,1,0,0,1,None],
'available_credit':[2200,100,22000,1100,2000,100,6000,2200,None]})
Yd = Xd.iloc[1:3,:]
X = np.asarray(Xd)
Y = np.asarray(Yd)
Find the distance matrix
gower.gower_matrix(X)
array([[0. , 0.3590238 , 0.6707398 , 0.31787416, 0.16872811,
0.52622986, 0.59697855, 0.47778758, nan],
[0.3590238 , 0. , 0.6964303 , 0.3138769 , 0.523629 ,
0.16720603, 0.45600235, 0.6539635 , nan],
[0.6707398 , 0.6964303 , 0. , 0.6552807 , 0.6728013 ,
0.6969697 , 0.740428 , 0.8151941 , nan],
[0.31787416, 0.3138769 , 0.6552807 , 0. , 0.4824794 ,
0.48108295, 0.74818605, 0.34332284, nan],
[0.16872811, 0.523629 , 0.6728013 , 0.4824794 , 0. ,
0.35750175, 0.43237334, 0.3121036 , nan],
[0.52622986, 0.16720603, 0.6969697 , 0.48108295, 0.35750175,
0. , 0.2898751 , 0.4878362 , nan],
[0.59697855, 0.45600235, 0.740428 , 0.74818605, 0.43237334,
0.2898751 , 0. , 0.57476616, nan],
[0.47778758, 0.6539635 , 0.8151941 , 0.34332284, 0.3121036 ,
0.4878362 , 0.57476616, 0. , nan],
[ nan, nan, nan, nan, nan,
nan, nan, nan, nan]], dtype=float32)
Find Top n results
gower.gower_topn(Xd.iloc[0:2,:], Xd.iloc[:,], n = 5)
{'index': array([4, 3, 1, 7, 5]),
'values': array([0.16872811, 0.31787416, 0.3590238 , 0.47778758, 0.52622986],
dtype=float32)}
Performance comparison with single-process version
Single process (DS-size: 10000, time: 15.58 sec.) █
Multi process (DS-size: 10000, time: 2.93 sec.)
Single process (DS-size: 20000, time: 54.30 sec.) █████
Multi process (DS-size: 20000, time: 11.57 sec.) █
Single process (DS-size: 30000, time: 119.80 sec.) ███████████
Multi process (DS-size: 30000, time: 24.86 sec.) ██
Single process (DS-size: 40000, time: 202.65 sec.) ████████████████████
Multi process (DS-size: 40000, time: 41.77 sec.) ████
Single process (DS-size: 50000, time: 318.64 sec.) ███████████████████████████████
Multi process (DS-size: 50000, time: 68.36 sec.) ██████
Single process (DS-size: 60000, time: 469.64 sec.) ██████████████████████████████████████████████
Multi process (DS-size: 60000, time: 96.24 sec.) █████████
Single process (DS-size: 70000, time: 653.27 sec.) █████████████████████████████████████████████████████████████████
Multi process (DS-size: 70000, time: 143.31 sec.) ██████████████
Single process (DS-size: 80000, time: 857.04 sec.) █████████████████████████████████████████████████████████████████████████████████████
Multi process (DS-size: 80000, time: 181.60 sec.) ██████████████████
Single process (DS-size: 90000, time: 1129.21 sec.) ████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Multi process (DS-size: 90000, time: 252.36 sec.) █████████████████████████
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gower_multiprocessing-0.2.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8baf79a37f7a6d73658a37ba46962936971f514d7955863ac244872a0e7e5a36 |
|
MD5 | 33e17ea3faadcb4c32d41a5374a64028 |
|
BLAKE2b-256 | 482c9b6654e83e5621e2c9c0add098c8d7583cb7779bd0bee6fcb66e26214a7c |
Hashes for gower_multiprocessing-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 76b17dad5ae9031bf5d51885b976946545d168f1e7d9d2a2c6e28a569e5716ff |
|
MD5 | fb1de622077efedff4d5179ca339c73b |
|
BLAKE2b-256 | 017a692d1f30c0ae3ca99185e5a976b8312e7ef4288fbd0702ee9c97ad23099e |