Skip to main content

A library of sklearn compatible contextual categorical variable encoders

Project description

Contextual Encoders

code style: black license Python: >= 3.7 Documentation Status Python Tests PyPi

Contextual Encoders is a library of scikit-learn compatible contextual variable encoders.

The documentation can be found here: ReadTheDocs.

This package uses Poetry (documentation).

Installation

The library can be installed with pip

pip install contextual-encoders

What are contextual variables?

Contextual variables are numerical or categorical variables, that underlie a certain context or relationship. Examples are the days of the week, that have a hidden graph structure:

When encoding these categorical variables with a simple encoding strategy such as One-Hot-Encoding, the hidden structure will be neglected. However, when the context can be specified, this additional information can be put it into the learning procedure to increase the performance of the learning model. This is, where Contextual Encoders come into place.

Principle

The step of encoding contextual variables is split up into four sub-steps:

  1. Define the context
  2. Define the measure
  3. Calculate the (dis-) similarity matrix
  4. Map the distance matrix to euclidean vectors

Setp 4. is optional and depends on the ML technique that uses the encoding. For example, Agglomerative Clustering techniques do not require euclidean vectors, they can use a dissimilarity matrix directly.

Basic Usage

The code below demonstrates the basic usage of the library. Here, a simple dataset with 10 features is used.

from contextual_encoders import ContextualEncoder, GraphContext, PathLengthMeasure
import numpy as np


# Create a sample dataset
x = np.array(["Fri", "Tue", "Fri", "Sat", "Mon", "Tue", "Wed", "Tue", "Fri", "Fri"])

# Step 1: Define the context
day = GraphContext("day")
day.add_concept("Mon", "Tue")
day.add_concept("Tue", "Wed")
day.add_concept("Wed", "Thur")
day.add_concept("Thur", "Fri")
day.add_concept("Fri", "Sat")
day.add_concept("Sat", "Sun")
day.add_concept("Sun", "Mon")

# Step 2: Define the measure
day_measure = PathLengthMeasure(day)

# Step 3+4: Calculate (Dis-) similarity Matrix
#           and map to euclidean vectors
encoder = ContextualEncoder(day_measure)
encoded_data = encoder.fit_transform(x)

similarity_matrix = encoder.get_similarity_matrix()
dissimilarity_matrix = encoder.get_dissimilarity_matrix()

The output of the code is visualized below. The graph-based structure can be clearly seen when the euclidean data points are plotted. Note, that only five points can be seen, because the days "Thur" and "Sun" are missing in the dataset.

Similarity Matrix Dissimilarity Matrix Euclidean Data Points

More complicated examples can be found in the documentation.

Notice

The Preprocessing module from scikit-learn offers multiple encoders for categorical variables. These encoders use simple techniques to encode categorical variables into numerical variables. Additionally, the Category Encoders package offers more sophisticated encoders for the same purpose. This package is meant to be used as an extension to the previous two packages in the cases, when the context of a numerical or categorical variable can be specified.

This project is currently in the developer stage.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextual-encoders-0.1.1.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

contextual_encoders-0.1.1-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file contextual-encoders-0.1.1.tar.gz.

File metadata

  • Download URL: contextual-encoders-0.1.1.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.6 CPython/3.9.5 Windows/10

File hashes

Hashes for contextual-encoders-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5b6858bd5038bca9f3958a3d0ca32bcfa15603ed43edd3d487bf724f8165c485
MD5 4ce6fb02095320c9f9ddbad8daa47ed3
BLAKE2b-256 e4bb706d1b28095df2748266ea15afb91fec9e0c77c5ef44e981d07bb86ec58c

See more details on using hashes here.

File details

Details for the file contextual_encoders-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for contextual_encoders-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7167ca48476ef925c5dd9386d2683a78132ae1188e42e453886f54e00eb4568c
MD5 1568530132419ebfd57140581c3800be
BLAKE2b-256 a2a89f4a334de75de6b9d6336a2b32c14618d87c82b83b33fe50fcc41bbba039

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page