Skip to main content

Feature selection tools based on semantic textual similarity (STS) scores.

Project description

sts-select

Overview

Small datasets often require feature selection to ensure generalizability, but such feature selection methods rely on the information provided in the dataset, which may not be enough to make good selections. An underutilized source of information is the similarity between the feature and target names, and we find that utilizing feature names in the selection process can sometimes improve the performance of feature selection methods. This is probably due to the fact that statistical measures of feature and target relationships are often noisy/incomplete, and the feature/target names can provide a more consistent measure of any relationship.

This package provides some Python tools to implement STS-based feature selection using fine-tuned models that you've trained either with sentence_transformers or Gensim. You can install it with pip install sts-select.

Usage

There are several steps to using this package. The first is to fine-tune a language model to produce semantic textual similarity (STS) scores. After that, you can use these scores to select features using either one of the selection methods provided or your own. From there you can apply this feature selection method to a dataset.

Fine-Tuning a Model

The first step is to fine-tune a model to produce STS scores. An example of how to obtain fine-tuning datasets for STS scores is can be found in examples/ppsp/redcap_sts_scorers/data.py. An example of how to train these models can be found in examples/ppsp/redcap_sts_scorers/train.py. We fine-tune our models using the sentence_transformers package and Gensim, but any other framework can be easily adapted to this library.

Scoring

Once you have a fine-tuned model, you can use it to score features, either by itself, or in combination with other scoring measures or models.

Here's a brief snippet of what this looks like in brief. A comprehensive example can be found in examples/ppsp/redcap/train.py.

pipe = Pipeline([
    ("selector", MRMRBase(
        SentenceTransformerScorer(
            X,
            y,
            X_names=X_names,
            y_names=y_names,
            model_path="your/model" # Should already be fine-tuned.  
        ), 
        n_features=20
    )),
    ("classifier", MLPClassifier(activation="relu", alpha=1))
])

pipe.fit(X, y)

Example

We provide a refactored version of our code used in our findings in the examples/ppsp directory for a comprehensive overview of how to use our package, along with a README that explains how to use it if you wish to start from there.

Citing

If you use this code in your research, please cite the following preprint:

[tbd]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sts_select-0.0.1.tar.gz (29.1 kB view details)

Uploaded Source

Built Distribution

sts_select-0.0.1-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file sts_select-0.0.1.tar.gz.

File metadata

  • Download URL: sts_select-0.0.1.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for sts_select-0.0.1.tar.gz
Algorithm Hash digest
SHA256 0640a16364163609e6d6c9f8ed86bdbfb5af183fde261276030d45fee19fc059
MD5 aab1163a5ad6b7818dc549dd9a64bb1c
BLAKE2b-256 eb68ef3da1111668a556ce52f03f2f1850f88560d3c15068b17a6e5f73976b31

See more details on using hashes here.

File details

Details for the file sts_select-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: sts_select-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for sts_select-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 486a884170b600c38f7e59407928834a8b9f107c6588073ee1ec52de018ac442
MD5 27fe42b62bdfbb461fa32e7e0a94e646
BLAKE2b-256 2bd5d5ec58f990efe913bc8e71ddf18b0a44fa7f66c876e46e76bff817002dbc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page