A SMOTE variant called sematic-cosine SMOTE that generates textual synthetic data
Project description
SCSMOTE- SEMANTIC-COSINE SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE
SMOTE Variant for Textual Data
Semantic-Cosine-based Synthetic Minority Over-sampling Technique (SC-SMOTE) is used to generate new words with semantic meaning by leveraging the concept of cosine similarity. The proposed method involves converting words to vectors using GloVe embeddings and then generating new vectors using SMOTE oversampling. The closest GloVe embedding is then identified and considered as the new word generated. However, the key innovation of this approach lies in how it ensures the semantic relevance of the generated words. To achieve this, cosine similarity is calculated between the newly generated vector and a relevant domain-specific term.
Installation :
pip install scsmote-package
Features
- Generates synthetic textual data that is semantically relevant and contextually appropriate
- Improves the accuracy of labelled datasets compared to original SMOTE or no oversampling
- Specifically designed for handling imbalanced textual data
- Uses cosine similarity to ensure the relevance of generated words
Usage
Here's a basic example of how to use SC-SMOTE:
from scsmote_package import scsmote
import numpy as np
with open("glove/glove.6B.200d.txt", encoding="utf8") as file:
data = file.readlines()
for i in range(len(data)):
data[i] = data[i][:-1]
data_dict = dict()
for i in range(len(data)):
split_data = data[i].split()
data_dict[split_data[0]] = np.array(split_data[1:]).astype('float64')
print(scsmote(['happy','sad','angry','mad','upset'],'emotions',data_dict))
Background
SC-SMOTE was developed to address the challenges of applying SMOTE to textual data, particularly in the context of classifying drug-related webpages on the dark web. It combines the SMOTE algorithm with cosine similarity to ensure that the generated synthetic samples are semantically meaningful and relevant to the minority class.
Advantages
- Improves classification performance on imbalanced textual datasets
- Reduces the time and effort required for manual data labeling
- Generates contextually appropriate synthetic samples
- Helps in avoiding overfitting that can occur with traditional SMOTE
GitHub Repository
For more detailed information, source code, and contributions, please visit our GitHub repository:
https://github.com/hetulmehta/SC-SMOTE
Requirements
- Python 3.7+
- numpy
- pandas
- scikit-learn
- gensim
Contributing
We welcome contributions to the SC-SMOTE project. Please feel free to submit issues and pull requests on our GitHub repository.
License
This project is licensed under the MIT License. See the LICENSE file in the GitHub repository for details.
Contact
For any queries or suggestions, please open an issue on the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scsmote_package-0.0.2.tar.gz
.
File metadata
- Download URL: scsmote_package-0.0.2.tar.gz
- Upload date:
- Size: 4.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f1b7457b595770c15aa70491548e415b5c542bda8a4d507775128090a8a2aab |
|
MD5 | d815ac077ca76e53ac14fadd10c1d092 |
|
BLAKE2b-256 | 840aadd30e28cac0ca2dcdb8094ee035b07c0480a746d48ca3da7df6bc194f69 |
File details
Details for the file scsmote_package-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: scsmote_package-0.0.2-py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba93bac6deb59e045a59e055dbd9d86133234d7b34f6f209959e2f1cd2ba7dfb |
|
MD5 | 0183a2b663d917183cac220f339702d9 |
|
BLAKE2b-256 | e2751815f5fecbbf190a9f6ac7725f6af6bbcd274139f755f4bf186412256b91 |