Skip to main content

Python implementation of Gowers distance, pairwise between records in two data sets with multiprocessing support

Project description

PyPI version Downloads

Introduction

Gower's distance calculation in Python. Gower Distance is a distance measure that can be used to calculate distance between two entity whose attribute has a mixed of categorical and numerical values. Gower (1971) A general coefficient of similarity and some of its properties. Biometrics 27 857–874.

More details and examples can be found on my personal website here:(https://www.thinkdatascience.com/post/2019-12-16-introducing-python-package-gower/)

Core functions are wrote by Marcelo Beckmann.

Multiprocessing added by Szymon Bobek

Examples

Installation

pip install gower-multiprocessing

Generate some data

import numpy as np
import pandas as pd
import gower-multiprocessing as gower

Xd=pd.DataFrame({'age':[21,21,19, 30,21,21,19,30,None],
'gender':['M','M','N','M','F','F','F','F',None],
'civil_status':['MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED',None],
'salary':[3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0,None],
'has_children':[1,0,1,1,1,0,0,1,None],
'available_credit':[2200,100,22000,1100,2000,100,6000,2200,None]})
Yd = Xd.iloc[1:3,:]
X = np.asarray(Xd)
Y = np.asarray(Yd)

Find the distance matrix

gower.gower_matrix(X)
array([[0.        , 0.3590238 , 0.6707398 , 0.31787416, 0.16872811,
        0.52622986, 0.59697855, 0.47778758,        nan],
       [0.3590238 , 0.        , 0.6964303 , 0.3138769 , 0.523629  ,
        0.16720603, 0.45600235, 0.6539635 ,        nan],
       [0.6707398 , 0.6964303 , 0.        , 0.6552807 , 0.6728013 ,
        0.6969697 , 0.740428  , 0.8151941 ,        nan],
       [0.31787416, 0.3138769 , 0.6552807 , 0.        , 0.4824794 ,
        0.48108295, 0.74818605, 0.34332284,        nan],
       [0.16872811, 0.523629  , 0.6728013 , 0.4824794 , 0.        ,
        0.35750175, 0.43237334, 0.3121036 ,        nan],
       [0.52622986, 0.16720603, 0.6969697 , 0.48108295, 0.35750175,
        0.        , 0.2898751 , 0.4878362 ,        nan],
       [0.59697855, 0.45600235, 0.740428  , 0.74818605, 0.43237334,
        0.2898751 , 0.        , 0.57476616,        nan],
       [0.47778758, 0.6539635 , 0.8151941 , 0.34332284, 0.3121036 ,
        0.4878362 , 0.57476616, 0.        ,        nan],
       [       nan,        nan,        nan,        nan,        nan,
               nan,        nan,        nan,        nan]], dtype=float32)

Find Top n results

gower.gower_topn(Xd.iloc[0:2,:], Xd.iloc[:,], n = 5)
{'index': array([4, 3, 1, 7, 5]),
 'values': array([0.16872811, 0.31787416, 0.3590238 , 0.47778758, 0.52622986],
       dtype=float32)}

Performance comparison with single-process version

Single process (DS-size: 10000, time:   15.58 sec.)	█
Multi process  (DS-size: 10000, time:    2.93 sec.)	

Single process (DS-size: 20000, time:   54.30 sec.)	█████
Multi process  (DS-size: 20000, time:   11.57 sec.)	█

Single process (DS-size: 30000, time:  119.80 sec.)	███████████
Multi process  (DS-size: 30000, time:   24.86 sec.)	██

Single process (DS-size: 40000, time:  202.65 sec.)	████████████████████
Multi process  (DS-size: 40000, time:   41.77 sec.)	████

Single process (DS-size: 50000, time:  318.64 sec.)	███████████████████████████████
Multi process  (DS-size: 50000, time:   68.36 sec.)	██████

Single process (DS-size: 60000, time:  469.64 sec.)	██████████████████████████████████████████████
Multi process  (DS-size: 60000, time:   96.24 sec.)	█████████

Single process (DS-size: 70000, time:  653.27 sec.)	█████████████████████████████████████████████████████████████████
Multi process  (DS-size: 70000, time:  143.31 sec.)	██████████████

Single process (DS-size: 80000, time:  857.04 sec.)	█████████████████████████████████████████████████████████████████████████████████████
Multi process  (DS-size: 80000, time:  181.60 sec.)	██████████████████

Single process (DS-size: 90000, time: 1129.21 sec.)	████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Multi process  (DS-size: 90000, time:  252.36 sec.)	█████████████████████████

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gower_multiprocessing-0.2.2.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

gower_multiprocessing-0.2.2-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file gower_multiprocessing-0.2.2.tar.gz.

File metadata

  • Download URL: gower_multiprocessing-0.2.2.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for gower_multiprocessing-0.2.2.tar.gz
Algorithm Hash digest
SHA256 7843fedd109418b4ff7d2e423a17616f753bc04741e29adf8e4e46ee9e77fc4f
MD5 4b41f99c232893c20361e806e5dc47a5
BLAKE2b-256 81eccf4b0077333e8ea9bb276f190f0ccd9d0fbbde0b5f52e1a80187fa8b768a

See more details on using hashes here.

File details

Details for the file gower_multiprocessing-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for gower_multiprocessing-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a2d3be9db82dd0d17cd16ac7fd25a51ee0d1eb00f34a552cb24eb169c04774d9
MD5 3904238c73a12b2d2cc9b415d0943518
BLAKE2b-256 aee3b4cba618d6ea4ee8e5a4e522050e1a82360f8322ee7e62ddf6bdcc46eb42

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page