A python package to compute pairwise Euclidean distances on datasets with categorical features in little time
Project description
Categorical Features Pairwise Euclidean Distances
A python package to compute pairwise Euclidean distances on datasets with categorical features in little time
Motivation
In machine learning model development I often ran into datasets with categorical features. Most times dealing with these categorical features was fairly straight forward (I would use the pandas get_dummies() function to convert each feature into a one-hot-encoded representaion).
But when the number of categories embedded in these categorical features became massive, I ran into the problem of extremely slow Euclidean distance computation between each sample and every other sample.
This is where this package comes in. Running my own tests, I concluded that this code runs significantly faster than the SKLearn pairwise Euclidean distances function on one-hot-encoded categorical features.
Prerequisites
See requirements.txt for the full list of Prerequisite libraries.
Installation
To start using this package, simply run this command in terminal
pip install cfed
Usage
import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split
df1 = pd.DataFrame.from_dict({
'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]
})
df2 = pd.DataFrame.from_dict({
'col1': [1, 4, 7],
'col2': [2, 5, 8],
'col3': [3, 6, 9],
})
distances = euclidean_distances(df1, df2, categorical_columns=['col1'])
Or without specifying categorical_columns
import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split
df1 = pd.DataFrame.from_dict({
'col1': ['c1', 'c2', 'c1'],
'col2': [4, 5, 6],
'col3': [7, 8, 9]
})
df2 = pd.DataFrame.from_dict({
'col1': ['c1', 'c3', 'c2'],
'col2': [2, 5, 8],
'col3': [3, 6, 9],
})
distances = euclidean_distances(df1, df2)
Or
import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split
df1_numerical = pd.DataFrame.from_dict({
'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]
})
df2_numerical = pd.DataFrame.from_dict({
'col1': [1, 4, 7],
'col2': [2, 5, 8],
'col3': [3, 6, 9],
})
df1_categorical = pd.DataFrame.from_dict({
'col4': ['c1', 'c1', 'c2'],
})
df2_categorical = pd.DataFrame.from_dict({
'col4': ['c1', 'c2', 'c2'],
})
distances = euclidean_distances_from_split(df1_numerical, df1_categorical, df2_numerical, df2_categorical)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cfed-1.1.tar.gz
.
File metadata
- Download URL: cfed-1.1.tar.gz
- Upload date:
- Size: 4.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b1793d7a2aad97dd135b9b900bdb4e603a71193325d830f86aa7d3bb822ba7b |
|
MD5 | c0422eff93215d00cb8f7f888489e439 |
|
BLAKE2b-256 | b31046d0b995e243aff992fd4dcf1f877d4f75d7718e3aa150c80b63cc8048e9 |
File details
Details for the file cfed-1.1-py3-none-any.whl
.
File metadata
- Download URL: cfed-1.1-py3-none-any.whl
- Upload date:
- Size: 5.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a313d8c506c1403037e4d65ce991c397f71ed3cef5f0db5e5cc1e70cfdfe9bc8 |
|
MD5 | d5c603e7d42251963f5bd8a8dbe8ffea |
|
BLAKE2b-256 | 0ac5167fc26c99d62c90c2438a4b1b1e9e54b1a55ead821625387c87322e016a |