Skip to main content

This project helps create synthetic window titles for model training.

Project description

Synthetic Data Generation for Window Titles

Generating synthetic window title data using NLP augmentation technique.

Overview

This project provides tools to generate synthetic data from window title strings collected from various applications. It supports two primary methods:

  • Substitution: Uses a BERT-based model to substitute words in context.
  • Substitution With N Variants: Uses a BERT-based model to substitute words in context, creating N variants of each window title.
  • Random Augmentation: Applies random swap, delete, or insert operations, with a fallback to contextual substitution when necessary.

Requirements

📦 Installation

To install the package, run

pip install synthetic-window-titles@git+https://github.com/paxray/synthetic-window-titles.git

to install without cloning the repository. If the repository is already cloned, running

pip install .

in the root folder also works.

Configuration

All configurable parameters live in constants.py. Here’s a full list of fields and example values:

  • PRESERVE_WORDS: List of substrings to keep intact during augmentation. (optional)

    PRESERVE_WORDS = [
        " Google Chrome", " Microsoft Edge", " Word ",
        " NAME ", "YEAR", "Explorer", "Outlook", " PKF "
    ]
    
  • METHOD: Integer selector for the active augmentation strategy (mandatory):

    • 1 ⇒ single contextual substitution
    • 2 ⇒ multiple contextual variants (requires N_VARIANTS)
    • 3 ⇒ random augmentation
    METHOD = 1
    
  • AUG_PERCENTAGE: Tune how many tokens are augmented. (mandatory if Method = 1 or 2):

    AUG_PERCENTAGE = 0.3   # used by substitution methods
    N_VARIANTS = 3        # used when METHOD == 2
    
  • N_VARIANTS: Tune how many variants to generate. (mandatory if Method = 2):

  • INPUT_FILE_PATH, OUTPUT_FILE_PATH: Source and destination JSON files.

    INPUT_FILE_PATH = r"src\data\input\windowTitlesTranslated.json"
    OUTPUT_FILE_PATH = r"src\data\output\syntheticDataUsingRandomAugmentation.json"
    

Usage

  1. Select an augmentation method by setting the METHOD constant in constants.py:

    • '1' for a single substitution pass
    • '2' to generate multiple contextual variants (controlled by N_VARIANTS)
    • '3' for a random augmentation
  2. Configure augmentation parameters in the same file:

    • AUG_PERCENTAGE determines the probability of applying contextual substitution
    • N_VARIANTS (used when METHOD is 2) specifies how many variants to create per input
  3. Run the main script from the project root:

    • python main.py
  4. Inspect your results at the location defined by OUTPUT_FILE_PATH. The script will load window titles from INPUT_FILE_PATH, apply the chosen augmentation strategy, and write the synthetic dataset accordingly.

Project Structure

.
└── src
    ├── common.py                                 # Shared utility functions
    ├── constants.py                              # Configuration constants
    ├── main.py                                   # Entry point script
    ├── syntheticData.py                          # Data preparation logic
    ├── syntheticDataUsingRandomAugmentation.py   # Random augmentation implementation
    ├── syntheticDataUsingSubstitution.py         # Contextual substitution implementation
    └── data
        ├── input
        │   └── windowTitlesTranslated.json   # Example input file
        └── output
            └── syntheticDataUsingRandomAugmentation.json  # Example output file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic_window_titles-0.1.0.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthetic_window_titles-0.1.0-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file synthetic_window_titles-0.1.0.tar.gz.

File metadata

  • Download URL: synthetic_window_titles-0.1.0.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for synthetic_window_titles-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9273760cb4e0125239389723cb7557902a43738a1fad2c3cb97878a1a78a5760
MD5 dacbe962ee534530cf3f33eaab52a4e1
BLAKE2b-256 1159906a8e180cadf8fc086d8efa811f3dc3193e17808c93e5c6315579f5409d

See more details on using hashes here.

File details

Details for the file synthetic_window_titles-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for synthetic_window_titles-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 482b7c70f978753341c9620802ca9b90fdcb3583220283d613d37a7a1cda9eca
MD5 348f08107fce077769dfe2836e63d229
BLAKE2b-256 e8af7f291d2e6808d473832d62c58a56ec518b377a4aad35074c13097d7cd4f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page