Skip to main content

Augmentex — a library for augmenting texts with errors

Project description

License Release Paper

Augmentex — a library for augmenting texts with errors

Augmentex introduces rule-based and common statistic (empowered by KartaSlov project) approach to insert errors in text. It is fully described again in the Paper and in this 🗣️Talk.

Contents

Installation

pip install augmentex

Implemented functionality

We collected statistics from different languages and from different input sources. This table shows what functionality the library currently supports.

Russian English
PC keyboard
Mobile kb

In the future, it is planned to scale the functionality to new languages and various input sources.

Usage

🖇️ Augmentex allows you to operate on two levels of granularity when it comes to text corruption and offers you sets of specific methods suited for particular level:

  • Word level:
    • replace - replace a random word with its incorrect counterpart;
    • delete - delete random word;
    • swap - swap two random words;
    • stopword - add random words from stop-list;
    • split - add spaces between letters to the word;
    • reverse - change a case of the first letter of a random word;
    • text2emoji - change the word to the corresponding emoji.
  • Character level:
    • shift - randomly swaps upper / lower case in a string;
    • orfo - substitute correct characters with their common incorrect counterparts;
    • typo - substitute correct characters as if they are mistyped on a keyboard;
    • delete - delete random character;
    • insert - insert random character;
    • multiply - multiply random character;
    • swap - swap two adjacent characters.

Word level

from augmentex import WordAug

word_aug = WordAug(
    unit_prob=0.4, # Percentage of the phrase to which augmentations will be applied
    min_aug=1, # Minimum number of augmentations
    max_aug=5, # Maximum number of augmentations
    lang="eng", # supports: "rus", "eng"
    platform="pc", # supports: "pc", "mobile"
    random_seed=42,
    )
  1. Replace a random word with its incorrect counterpart;
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="replace")
# Screw to guys, I to going com. (c)
  1. Delete random word;
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="delete")
# you I am home. (c)
  1. Swap two random words;
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="swap")
# Screw I guys, am home. going you (c)
  1. Add random words from stop-list;
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="stopword")
# like Screw you guys, I am going completely home. by the way (c)
  1. Adds spaces between letters to the word;
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="split")
# Screw y o u guys, I am going h o m e . (c)
  1. Change a case of the first letter of a random word;
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="reverse")
# Screw You guys, i Am going home. (c)
  1. Changes the word to the corresponding emoji.
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="text2emoji")
# Screw you guys, I am going home. (c)
  1. Replaces ngram in a word with erroneous ones.
text = "Screw you guys, I am going home. (c)"
word_aug.augment(text=text, action="ngram")
# Scren you guys, I am going home. (c)

Character level

from augmentex import CharAug

char_aug = CharAug(
    unit_prob=0.3, # Percentage of the phrase to which augmentations will be applied
    min_aug=1, # Minimum number of augmentations
    max_aug=5, # Maximum number of augmentations
    mult_num=3, # Maximum number of repetitions of characters (only for the multiply method)
    lang="eng", # supports: "rus", "eng"
    platform="pc", # supports: "pc", "mobile"
    random_seed=42,
    )
  1. Randomly swaps upper / lower case in a string;
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="shift")
# Screw YoU guys, I am going Home. (C)
  1. Substitute correct characters with their common incorrect counterparts;
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="orfo")
# Sedew you guya, I am going home. (c)
  1. Substitute correct characters as if they are mistyped on a keyboard;
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="typo")
# Sxrew you gugs, I am going home. (x)
  1. Delete random character;
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="delete")
# crew you guys Iam goinghme. (c)
  1. Insert random character;
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="insert")
# Screw you ughuys, I vam gcoing hxome. (c)
  1. Multiply random character;
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="multiply")
# Screw yyou guyss, I am ggoinng home. (c)
  1. Swap two adjacent characters.
text = "Screw you guys, I am going home. (c)"
char_aug.augment(text=text, action="swap")
# Srcewy ou guys,I  am oging hmoe. (c)

Batch processing

📁 For batch text processing, you need to call the aug_batch method instead of the augment method and pass a list of strings to it.

from augmentex import WordAug

word_aug = WordAug(
    unit_prob=0.4, # Percentage of the phrase to which augmentations will be applied
    min_aug=1, # Minimum number of augmentations
    max_aug=5, # Maximum number of augmentations
    lang="eng", # supports: "rus", "eng"
    platform="pc", # supports: "pc", "mobile"
    random_seed=42,
    )

text_list = ["Screw you guys, I am going home. (c)"] * 10
word_aug.aug_batch(text_list, batch_prob=0.5) # without action

text_list = ["Screw you guys, I am going home. (c)"] * 10
word_aug.aug_batch(text_list, batch_prob=0.5, action="replace") # with action

Compute your own statistics

📊 If you want to use your own statistics for the replace and orfo methods, then you will need to specify two paths to parallel corpora with texts without errors and with errors.

Example of txt files:

texts_without_errors.txt texts_with_errors.txt

some text without errors 1
some text without errors 2
some text without errors 3
...

some text with errors 1
some text with errors 2
some text with errors 3
...

from augmentex import WordAug

word_aug = WordAug(
    unit_prob=0.4, # Percentage of the phrase to which augmentations will be applied
    min_aug=1, # Minimum number of augmentations
    max_aug=5, # Maximum number of augmentations
    lang="eng", # supports: "rus", "eng"
    platform="pc", # supports: "pc", "mobile"
    random_seed=42,
    correct_texts_path="correct_texts.txt",
    error_texts_path="error_texts.txt",
    )

Google Colab example

You can familiarize yourself with the usage in the example Try In Colab!

Contributing

Issue

  • If you see an open issue and are willing to do it, add yourself to the performers and write about how much time it will take to fix it. See the pull request module below.
  • If you want to add something new or if you find a bug, you should start by creating a new issue and describing the problem/feature. Don't forget to include the appropriate labels.

Pull request

How to make a pull request.

  1. Clone the repository;
  2. Create a new branch, for example git checkout -b issue-id-short-name;
  3. Make changes to the code (make sure you are definitely working in the new branch);
  4. git push;
  5. Create a pull request to the develop branch;
  6. Add a brief description of the work done;
  7. Expect comments from the authors.

References

  • SAGE — superlib, developed jointly with our friends by the AGI NLP team, which provides advanced spelling corruptions and spell checking techniques, including using Augmentex.

Authors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

augmentex-1.2.0.tar.gz (19.5 MB view details)

Uploaded Source

Built Distribution

augmentex-1.2.0-py3-none-any.whl (22.5 MB view details)

Uploaded Python 3

File details

Details for the file augmentex-1.2.0.tar.gz.

File metadata

  • Download URL: augmentex-1.2.0.tar.gz
  • Upload date:
  • Size: 19.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for augmentex-1.2.0.tar.gz
Algorithm Hash digest
SHA256 da7d4ac148d606e883bf176a673bd9aba2076970a95e37f92f72f936d0560967
MD5 f3c449afc79f0d08635323e6df5caa00
BLAKE2b-256 60f285af13a0bba1eeb8afbd20e6ea8e63bcde56d93fe2d5dc8d600a9d6e1e22

See more details on using hashes here.

File details

Details for the file augmentex-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: augmentex-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for augmentex-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 afdd7552fe61713d51f00867ab3c3e22df7784c5e52d0632abc6d5be6c9b8060
MD5 93284951d3c9e46c0afbe35f1299d03b
BLAKE2b-256 f5ee47e1f2389fc1182388051669902f20a1175846b782caa1d092e6af0ff577

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page