Skip to main content

A python package to compute pairwise Euclidean distances on datasets with categorical features in little time

Project description

Categorical Features Pairwise Euclidean Distances

forthebadge made-with-python

PyPi Version MIT License

A python package to compute pairwise Euclidean distances on datasets with categorical features in little time

Motivation

In machine learning model development I often ran into datasets with categorical features. Most times dealing with these categorical features was fairly straight forward (I would use the pandas get_dummies() function to convert each feature into a one-hot-encoded representaion).

But when the number of categories embedded in these categorical features became massive, I ran into the problem of extremely slow Euclidean distance computation between each sample and every other sample.

This is where this package comes in. Running my own tests, I concluded that this code runs significantly faster than the SKLearn pairwise Euclidean distances function on one-hot-encoded categorical features.

Prerequisites

See requirements.txt for the full list of Prerequisite libraries.

Installation

To start using this package, simply run this command in terminal

pip install cfed

Usage

import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split

df1 = pd.DataFrame.from_dict({
    'col1': [1, 2, 3],
    'col2': [4, 5, 6],
    'col3': [7, 8, 9]
})
df2 = pd.DataFrame.from_dict({
    'col1': [1, 4, 7],
    'col2': [2, 5, 8],
    'col3': [3, 6, 9],
})

distances = euclidean_distances(df1, df2, categorical_columns=['col1'])

Or without specifying categorical_columns

import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split

df1 = pd.DataFrame.from_dict({
    'col1': ['c1', 'c2', 'c1'],
    'col2': [4, 5, 6],
    'col3': [7, 8, 9]
})
df2 = pd.DataFrame.from_dict({
    'col1': ['c1', 'c3', 'c2'],
    'col2': [2, 5, 8],
    'col3': [3, 6, 9],
})

distances = euclidean_distances(df1, df2)

Or

import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split

df1_numerical = pd.DataFrame.from_dict({
    'col1': [1, 2, 3],
    'col2': [4, 5, 6],
    'col3': [7, 8, 9]
})
df2_numerical = pd.DataFrame.from_dict({
    'col1': [1, 4, 7],
    'col2': [2, 5, 8],
    'col3': [3, 6, 9],
})

df1_categorical = pd.DataFrame.from_dict({
    'col4': ['c1', 'c1', 'c2'],
})
df2_categorical = pd.DataFrame.from_dict({
    'col4': ['c1', 'c2', 'c2'],
})

distances = euclidean_distances_from_split(df1_numerical, df1_categorical, df2_numerical, df2_categorical)

Project details


Release history Release notifications | RSS feed

This version

1.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cfed-1.1.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

cfed-1.1-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file cfed-1.1.tar.gz.

File metadata

  • Download URL: cfed-1.1.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for cfed-1.1.tar.gz
Algorithm Hash digest
SHA256 4b1793d7a2aad97dd135b9b900bdb4e603a71193325d830f86aa7d3bb822ba7b
MD5 c0422eff93215d00cb8f7f888489e439
BLAKE2b-256 b31046d0b995e243aff992fd4dcf1f877d4f75d7718e3aa150c80b63cc8048e9

See more details on using hashes here.

File details

Details for the file cfed-1.1-py3-none-any.whl.

File metadata

  • Download URL: cfed-1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for cfed-1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a313d8c506c1403037e4d65ce991c397f71ed3cef5f0db5e5cc1e70cfdfe9bc8
MD5 d5c603e7d42251963f5bd8a8dbe8ffea
BLAKE2b-256 0ac5167fc26c99d62c90c2438a4b1b1e9e54b1a55ead821625387c87322e016a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page