Skip to main content

Text augmentation.

Project description

PyPI version DOI augtxt Total alerts Language grade: Python PyPi downloads

augtxt – Text Augmentation

Yet another text augmentation python package.

Table of Contents

  • Usage

    • `augtxt.augmenters - Pipelines <#pipelines>`__

      • `sentaugm - Sentence Augmentation <#sentence-augmentations>`__

      • `wordtypo - Word Typos <#word-typos>`__

      • `senttypo - Word typos for a sentence <#word-typos-for-a-sentence>`__

    • `augtxt.typo - Typographical Errors <#typographical-errors-tippfehler>`__

    • `augtxt.punct - Interpunctation Errors <#interpunctation-errors-zeichensetzungsfehler>`__

    • `augtxt.order - Word Order Errors <#word-order-errors-wortstellungsfehler>`__

    • `augtxt.wordsubs - Word substitutions <#word-substitutions>`__

  • Appendix

Usage

import augtxt
import numpy as np

Pipelines

Sentence Augmentations

Check the demo notebook for an usage example.

Word typos

The function augtxt.augmenters.wordtypo applies randomly different augmentations to one word. The result is a simulated distribution of possible word augmentations, e.g. how are possible typological errors distributed for a specific original word. The procedure does not guarantee that the original word will be augmented.

Check the demo notebook for an usage example.

Word typos for a sentence

The function augtxt.augmenters.senttypo applies randomly different augmentations to a) at least one word in a sentence, or b) not more than a certain percentage of words in a sentence. The procedure guarantees that the sentence is augmented.

The functions also allows to exclude specific strings from augmentation (e.g. exclude=("[MASK]", "[UNK]")). However, these strings cannot include the special characters .,;:!? (incl. whitespace).

Check the demo notebook for an usage example.

Typographical Errors (Tippfehler)

The augtxt.typo module is about augmenting characters to mimic human errors while using a keyboard device.

Swap two consecutive characters (Vertauscher)

A user mix two consecutive characters up.

  • Swap 1st and 2nd characters: augtxt.typo.swap_consecutive("Kinder", loc=0) (Result: iKnder)

  • Swap 1st and 2nd characters, and enforce letter cases: augtxt.typo.swap_consecutive("Kinder", loc=0, keep_case=True) (Result: Iknder)

  • Swap random i-th and i+1-th characters that are more likely at the end of the word: np.random.seed(seed=123); augtxt.typo.swap_consecutive("Kinder", loc='end')

Add double letter (Einfüger)

User presses a key twice accidentaly

  • Make 5th letter a double letter: `augtxt.typo.pressed_twice("Eltern", loc=4) (Result: Elterrn)

Drop character (Auslasser)

User presses the key not enough (Lisbach, 2011, p.72), the key is broken, finger motion fails.

  • Drop the 3rd letter: augtxt.typo.drop_char("Straße", loc=2) (Result: Staße)

Drop character followed by double letter (Vertipper)

Letter is left out, but the following letter is typed twice. It’s a combination of augtxt.typo.pressed_twice and augtxt.typo.drop_char.

from augtxt.typo import drop_n_next_twice
augm = drop_n_next_twice("Tante", loc=2)
# Tatte

Pressed SHIFT, ALT, or SHIFT+ALT

Usually SHFIT is used to type a capital letter, and ALT or ALT+SHIFT for less common characters. A typo might occur because these special keys are nor are not pressed in combination with a normal key. The function augtxt.typo.pressed_shiftalt such errors randomly.

from augtxt.typo import pressed_shiftalt
augm = pressed_shiftalt("Onkel", loc=2)
# OnKel, On˚el, Onel

The keymap can differ depending on the language and the keyboard layout.

from augtxt.typo import pressed_shiftalt
import augtxt.keyboard_layouts as kbl
augm = pressed_shiftalt("Onkel", loc=2, keymap=kbl.macbook_us)
# OnKel, On˚el, Onel

Further, transition probabilities in case of a typo can be specified

from augtxt.typo import pressed_shiftalt
import augtxt.keyboard_layouts as kbl

keyboard_transprob = {
    "keys": [.0, .75, .2, .05],
    "shift": [.9, 0, .05, .05],
    "alt": [.9, .05, .0, .05],
    "shift+alt": [.3, .35, .35, .0]
}

augm = pressed_shiftalt("Onkel", loc=2, keymap=kbl.macbook_us, trans=keyboard_transprob)

References

Interpunctation Errors (Zeichensetzungsfehler)

Remove PUNCT and COMMA tokens

The PUNCT (.?!;:) and COMMA (,) tokens carry syntatic information. An use case

import augtxt.punct
text = ("Die Lehrerin [MASK] einen Roman. "
        "Die Schülerin [MASK] ein Aufsatz, der sehr [MASK] war.")
augmented = augtxt.punct.remove_syntaxinfo(text)
# 'Die Lehrerin [MASK] einen Roman Die Schülerin [MASK] ein Aufsatz der sehr [MASK] war'

Merge two consequitive words

The function augtxt.punct.merge_words removes randomly whitespace or hyphens between words, and transform the second word to lower case.

import augtxt.punct

text = "Die Bindestrich-Wörter sind da."

np.random.seed(seed=23)
augmented = augtxt.punct.merge_words(text, num_aug=1)
assert augmented == 'Die Bindestrich-Wörter sindda.'

np.random.seed(seed=1)
augmented = augtxt.punct.merge_words(text, num_aug=1)
assert augmented == 'Die Bindestrichwörter sind da.'

Word Order Errors (Wortstellungsfehler)

The augtxt.order simulate errors on word token level.

Swap words

np.random.seed(seed=42)
text = "Tausche die Wörter, lasse sie weg, oder [MASK] was."
print(augtxt.order.swap_consecutive(text, exclude=["[MASK]"], num_aug=1))
# die Tausche Wörter, lasse sie weg, oder [MASK] was.

Write twice

np.random.seed(seed=42)
text = "Tausche die Wörter, lasse sie weg, oder [MASK] was."
print(augtxt.order.write_twice(text, exclude=["[MASK]"], num_aug=1))
# Tausche die die Wörter, lasse sie weg, oder [MASK] was.

Drop word

np.random.seed(seed=42)
text = "Tausche die Wörter, lasse sie weg, oder [MASK] was."
print(augtxt.order.drop_word(text, exclude=["[MASK]"], num_aug=1))
# Tausche Wörter, lasse sie weg, oder [MASK] was.

Drop word followed by a double word

np.random.seed(seed=42)
text = "Tausche die Wörter, lasse sie weg, oder [MASK] was."
print(augtxt.order.drop_n_next_twice(text, exclude=["[MASK]"], num_aug=1))
# die die Wörter, lasse sie weg, oder [MASK] was.

[STRIKEOUT:Word substitutions] (Deprecated)

Deprecation Notice: augtxt.wordsubs will be deleted in 0.6.0 and replaced. Especially synonym replacement is not trivial in German language. Please check https://github.com/ulf1/flexion for further information.

The augtxt.wordsubs module is about replacing specific strings, e.g. words, morphemes, named entities, abbreviations, etc.

Using pseudo-synonym dictionaries to augment tokenized sequences

It is recommend to filter vocab further. For example, PoS tag the sequences and only augment VERB and NOUN tokens.

import itertools
import augtxt.wordsubs
import numpy as np

original_seqs = [["Das", "ist", "ein", "Satz", "."], ["Dies", "ist", "ein", "anderer", "Satz", "."]]
vocab = set([s.lower() for s in itertools.chain(*original_seqs) if len(s) > 1])

synonyms = {
    'anderer': ['verschiedener', 'einiger', 'vieler', 'diverser', 'sonstiger',
                'etlicher', 'einzelner', 'bestimmter', 'ähnlicher'],
    'satz': ['sätze', 'anfangssatz', 'schlussatz', 'eingangssatz', 'einleitungssatzes',
             'einleitungsssatz', 'einleitungssatz', 'behauptungssatz', 'beispielsatz',
             'schlusssatz', 'anfangssatzes', 'einzelsatz', '#einleitungssatz',
             'minimalsatz', 'inhaltssatz', 'aufforderungssatz', 'ausgangssatz'],
    '.': [',', '🎅'],
    'das': ['welches', 'solches'],
    'ein': ['weiteres'],
    'dies': ['was', 'umstand', 'dass']
}

np.random.seed(42)
augmented_seqs = augtxt.wordsubs.synonym_replacement(
    original_seqs, synonyms, num_aug=10, keep_case=True)

# check results for 1st sentence
for s in augmented_seqs[0]:
    print(s)

Appendix

Installation

The augtxt git repo is available as PyPi package

pip install augtxt>=0.5.0
pip install git+ssh://git@github.com/ulf1/augtxt.git

Commands

Install a virtual environment

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -r requirements-demo.txt

(If your git repo is stored in a folder with whitespaces, then don’t use the subfolder .venv. Use an absolute path without whitespaces.)

Python commands

  • Check syntax: flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')

  • Run Unit Tests: pytest

Publish

pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist
twine upload -r pypi dist/*

Clean up

find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv

Support

Please open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

augtxt-0.5.1.tar.gz (22.5 kB view details)

Uploaded Source

File details

Details for the file augtxt-0.5.1.tar.gz.

File metadata

  • Download URL: augtxt-0.5.1.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.8.2 requests/2.26.0 setuptools/60.1.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.9

File hashes

Hashes for augtxt-0.5.1.tar.gz
Algorithm Hash digest
SHA256 b85f87c9eb47ca5604e051cb96c6c11618c6f84faf09be27347be6273321efef
MD5 48185924ee5cce189d931aca3a6804d1
BLAKE2b-256 de2e930546eef813bbae3255ff2156f4f8d67afe430b63196a40fc00aa2a0811

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page