This project helps create synthetic window titles for model training.
Project description
Synthetic Data Generation for Window Titles
Generating synthetic window title data using NLP augmentation technique.
Overview
This project provides tools to generate synthetic data from window title strings collected from various applications. It supports two primary methods:
- Substitution: Uses a BERT-based model to substitute words in context.
- Substitution With N Variants: Uses a BERT-based model to substitute words in context, creating N variants of each window title.
- Random Augmentation: Applies random swap, delete, or insert operations, with a fallback to contextual substitution when necessary.
Requirements
- Python 3.10 or higher
- nlpaug
- torch and transformers
📦 Installation
To install the package, run
pip install synthetic-window-titles@git+https://github.com/paxray/synthetic-window-titles.git
to install without cloning the repository. If the repository is already cloned, running
pip install .
in the root folder also works.
Configuration
All configurable parameters live in constants.py. Here’s a full list of fields and example values:
-
PRESERVE_WORDS: List of substrings to keep intact during augmentation. (optional)
PRESERVE_WORDS = [ " Google Chrome", " Microsoft Edge", " Word ", " NAME ", "YEAR", "Explorer", "Outlook", " PKF " ]
-
METHOD: Integer selector for the active augmentation strategy (mandatory):
1⇒ single contextual substitution2⇒ multiple contextual variants (requiresN_VARIANTS)3⇒ random augmentation
METHOD = 1
-
AUG_PERCENTAGE: Tune how many tokens are augmented. (mandatory if Method = 1 or 2):
AUG_PERCENTAGE = 0.3 # used by substitution methods N_VARIANTS = 3 # used when METHOD == 2
-
N_VARIANTS: Tune how many variants to generate. (mandatory if Method = 2):
-
INPUT_FILE_PATH, OUTPUT_FILE_PATH: Source and destination JSON files.
INPUT_FILE_PATH = r"src\data\input\windowTitlesTranslated.json" OUTPUT_FILE_PATH = r"src\data\output\syntheticDataUsingRandomAugmentation.json"
Usage
-
Select an augmentation method by setting the METHOD constant in constants.py:
- '1' for a single substitution pass
- '2' to generate multiple contextual variants (controlled by N_VARIANTS)
- '3' for a random augmentation
-
Configure augmentation parameters in the same file:
- AUG_PERCENTAGE determines the probability of applying contextual substitution
- N_VARIANTS (used when METHOD is 2) specifies how many variants to create per input
-
Run the main script from the project root:
- python main.py
-
Inspect your results at the location defined by OUTPUT_FILE_PATH. The script will load window titles from INPUT_FILE_PATH, apply the chosen augmentation strategy, and write the synthetic dataset accordingly.
Project Structure
.
└── src
├── common.py # Shared utility functions
├── constants.py # Configuration constants
├── main.py # Entry point script
├── syntheticData.py # Data preparation logic
├── syntheticDataUsingRandomAugmentation.py # Random augmentation implementation
├── syntheticDataUsingSubstitution.py # Contextual substitution implementation
└── data
├── input
│ └── windowTitlesTranslated.json # Example input file
└── output
└── syntheticDataUsingRandomAugmentation.json # Example output file
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synthetic_window_titles-0.1.0.tar.gz.
File metadata
- Download URL: synthetic_window_titles-0.1.0.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9273760cb4e0125239389723cb7557902a43738a1fad2c3cb97878a1a78a5760
|
|
| MD5 |
dacbe962ee534530cf3f33eaab52a4e1
|
|
| BLAKE2b-256 |
1159906a8e180cadf8fc086d8efa811f3dc3193e17808c93e5c6315579f5409d
|
File details
Details for the file synthetic_window_titles-0.1.0-py3-none-any.whl.
File metadata
- Download URL: synthetic_window_titles-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
482b7c70f978753341c9620802ca9b90fdcb3583220283d613d37a7a1cda9eca
|
|
| MD5 |
348f08107fce077769dfe2836e63d229
|
|
| BLAKE2b-256 |
e8af7f291d2e6808d473832d62c58a56ec518b377a4aad35074c13097d7cd4f3
|