Skip to main content

A SMOTE variant called sematic-cosine SMOTE that generates textual synthetic data

Project description

SCSMOTE- SEMANTIC-COSINE SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE

SMOTE Variant for Textual Data

Semantic-Cosine-based Synthetic Minority Over-sampling Technique (SC-SMOTE) is used to generate new words with semantic meaning by leveraging the concept of cosine similarity. The proposed method involves converting words to vectors using GloVe embeddings and then generating new vectors using SMOTE oversampling. The closest GloVe embedding is then identified and considered as the new word generated. However, the key innovation of this approach lies in how it ensures the semantic relevance of the generated words. To achieve this, cosine similarity is calculated between the newly generated vector and a relevant domain-specific term.

Installation :

pip install scsmote-package

Features

  • Generates synthetic textual data that is semantically relevant and contextually appropriate
  • Improves the accuracy of labelled datasets compared to original SMOTE or no oversampling
  • Specifically designed for handling imbalanced textual data
  • Uses cosine similarity to ensure the relevance of generated words

Usage

Here's a basic example of how to use SC-SMOTE:

from scsmote_package import scsmote
import numpy as np

with open("glove/glove.6B.200d.txt", encoding="utf8") as file:
    data = file.readlines()
for i in range(len(data)):
    data[i] = data[i][:-1]
data_dict = dict()
for i in range(len(data)):
    split_data = data[i].split()
    data_dict[split_data[0]] = np.array(split_data[1:]).astype('float64')

print(scsmote(['happy','sad','angry','mad','upset'],'emotions',data_dict))

Background

SC-SMOTE was developed to address the challenges of applying SMOTE to textual data, particularly in the context of classifying drug-related webpages on the dark web. It combines the SMOTE algorithm with cosine similarity to ensure that the generated synthetic samples are semantically meaningful and relevant to the minority class.

Advantages

  • Improves classification performance on imbalanced textual datasets
  • Reduces the time and effort required for manual data labeling
  • Generates contextually appropriate synthetic samples
  • Helps in avoiding overfitting that can occur with traditional SMOTE

GitHub Repository

For more detailed information, source code, and contributions, please visit our GitHub repository:

https://github.com/hetulmehta/SC-SMOTE

Requirements

  • Python 3.7+
  • numpy
  • pandas
  • scikit-learn
  • gensim

Contributing

We welcome contributions to the SC-SMOTE project. Please feel free to submit issues and pull requests on our GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file in the GitHub repository for details.

Contact

For any queries or suggestions, please open an issue on the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scsmote_package-0.0.2.tar.gz (4.1 kB view details)

Uploaded Source

Built Distribution

scsmote_package-0.0.2-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file scsmote_package-0.0.2.tar.gz.

File metadata

  • Download URL: scsmote_package-0.0.2.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for scsmote_package-0.0.2.tar.gz
Algorithm Hash digest
SHA256 2f1b7457b595770c15aa70491548e415b5c542bda8a4d507775128090a8a2aab
MD5 d815ac077ca76e53ac14fadd10c1d092
BLAKE2b-256 840aadd30e28cac0ca2dcdb8094ee035b07c0480a746d48ca3da7df6bc194f69

See more details on using hashes here.

File details

Details for the file scsmote_package-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for scsmote_package-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ba93bac6deb59e045a59e055dbd9d86133234d7b34f6f209959e2f1cd2ba7dfb
MD5 0183a2b663d917183cac220f339702d9
BLAKE2b-256 e2751815f5fecbbf190a9f6ac7725f6af6bbcd274139f755f4bf186412256b91

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page