Python implementation of Gowers distance, pairwise between records in two data sets with multiprocessing support
Project description
Introduction
Gower's distance calculation in Python. Gower Distance is a distance measure that can be used to calculate distance between two entity whose attribute has a mixed of categorical and numerical values. Gower (1971) A general coefficient of similarity and some of its properties. Biometrics 27 857–874.
More details and examples can be found on my personal website here:(https://www.thinkdatascience.com/post/2019-12-16-introducing-python-package-gower/)
Core functions are wrote by Marcelo Beckmann.
Multiprocessing added by Szymon Bobek
Examples
Installation
pip install gower-multiprocessing
Generate some data
import numpy as np
import pandas as pd
import gower-multiprocessing as gower
Xd=pd.DataFrame({'age':[21,21,19, 30,21,21,19,30,None],
'gender':['M','M','N','M','F','F','F','F',None],
'civil_status':['MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED',None],
'salary':[3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0,None],
'has_children':[1,0,1,1,1,0,0,1,None],
'available_credit':[2200,100,22000,1100,2000,100,6000,2200,None]})
Yd = Xd.iloc[1:3,:]
X = np.asarray(Xd)
Y = np.asarray(Yd)
Find the distance matrix
gower.gower_matrix(X)
array([[0. , 0.3590238 , 0.6707398 , 0.31787416, 0.16872811,
0.52622986, 0.59697855, 0.47778758, nan],
[0.3590238 , 0. , 0.6964303 , 0.3138769 , 0.523629 ,
0.16720603, 0.45600235, 0.6539635 , nan],
[0.6707398 , 0.6964303 , 0. , 0.6552807 , 0.6728013 ,
0.6969697 , 0.740428 , 0.8151941 , nan],
[0.31787416, 0.3138769 , 0.6552807 , 0. , 0.4824794 ,
0.48108295, 0.74818605, 0.34332284, nan],
[0.16872811, 0.523629 , 0.6728013 , 0.4824794 , 0. ,
0.35750175, 0.43237334, 0.3121036 , nan],
[0.52622986, 0.16720603, 0.6969697 , 0.48108295, 0.35750175,
0. , 0.2898751 , 0.4878362 , nan],
[0.59697855, 0.45600235, 0.740428 , 0.74818605, 0.43237334,
0.2898751 , 0. , 0.57476616, nan],
[0.47778758, 0.6539635 , 0.8151941 , 0.34332284, 0.3121036 ,
0.4878362 , 0.57476616, 0. , nan],
[ nan, nan, nan, nan, nan,
nan, nan, nan, nan]], dtype=float32)
Find Top n results
gower.gower_topn(Xd.iloc[0:2,:], Xd.iloc[:,], n = 5)
{'index': array([4, 3, 1, 7, 5]),
'values': array([0.16872811, 0.31787416, 0.3590238 , 0.47778758, 0.52622986],
dtype=float32)}
Performance comparison with single-process version
Single process (DS-size: 10000, time: 15.58 sec.) █
Multi process (DS-size: 10000, time: 2.93 sec.)
Single process (DS-size: 20000, time: 54.30 sec.) █████
Multi process (DS-size: 20000, time: 11.57 sec.) █
Single process (DS-size: 30000, time: 119.80 sec.) ███████████
Multi process (DS-size: 30000, time: 24.86 sec.) ██
Single process (DS-size: 40000, time: 202.65 sec.) ████████████████████
Multi process (DS-size: 40000, time: 41.77 sec.) ████
Single process (DS-size: 50000, time: 318.64 sec.) ███████████████████████████████
Multi process (DS-size: 50000, time: 68.36 sec.) ██████
Single process (DS-size: 60000, time: 469.64 sec.) ██████████████████████████████████████████████
Multi process (DS-size: 60000, time: 96.24 sec.) █████████
Single process (DS-size: 70000, time: 653.27 sec.) █████████████████████████████████████████████████████████████████
Multi process (DS-size: 70000, time: 143.31 sec.) ██████████████
Single process (DS-size: 80000, time: 857.04 sec.) █████████████████████████████████████████████████████████████████████████████████████
Multi process (DS-size: 80000, time: 181.60 sec.) ██████████████████
Single process (DS-size: 90000, time: 1129.21 sec.) ████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Multi process (DS-size: 90000, time: 252.36 sec.) █████████████████████████
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gower_multiprocessing-0.2.2.tar.gz
.
File metadata
- Download URL: gower_multiprocessing-0.2.2.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7843fedd109418b4ff7d2e423a17616f753bc04741e29adf8e4e46ee9e77fc4f |
|
MD5 | 4b41f99c232893c20361e806e5dc47a5 |
|
BLAKE2b-256 | 81eccf4b0077333e8ea9bb276f190f0ccd9d0fbbde0b5f52e1a80187fa8b768a |
File details
Details for the file gower_multiprocessing-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: gower_multiprocessing-0.2.2-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2d3be9db82dd0d17cd16ac7fd25a51ee0d1eb00f34a552cb24eb169c04774d9 |
|
MD5 | 3904238c73a12b2d2cc9b415d0943518 |
|
BLAKE2b-256 | aee3b4cba618d6ea4ee8e5a4e522050e1a82360f8322ee7e62ddf6bdcc46eb42 |