Skip to main content

a micro utility for generating plausible misspellings

Project description

MrS SpELliNgS

a micro utility to procedurally generate plausible misspellings


Table of Contents


Install

from pypi

pip install mrs-spellings

from source

python -m pip install git+https://github.com/CircArgs/mrs_spellings.git

Use Cases

  • Generate misspellings to replace during the text cleaning process with low overhead
  • Replace words with their potential misspellings as an augmentation during
    • training to make your model less susceptible to misspellings
    • during test time as part of TTA
  • Supplement an existing solution for out-of-vocabulary words/ words that do not appear in an existing replacement dictionary

Usage

There are 3 primary methods currently supported:

In [1]: from mrs_spellings import MrsWord, MrsSpellings                                                                                                                                                            
#methods return MrsSpellings
In [2]: MrsWord("hello").swap()                                                                                                                                                                      
Out[2]: {'ehllo', 'hello', 'helol', 'hlelo'}

In [3]: MrsWord("hello").delete(number_deletes=1)                                                                                                                                                    
Out[3]: {'ello', 'hell', 'helo', 'hllo'}

In [4]: MrsWord("hello").qwerty_swap(max_distance=1)                                                                                                                                                 
Out[4]: 
{'gello',
 'h3llo',
 'hdllo',
 'he,lo',
 'he:lo',
  ...
 'jello',
 'nello',
 'yello'}
# simply chain methods
In [5]: MrsWord("hello").swap().delete()                                                                                                                                                             
Out[5]: 
{'ehll',
 'ehlo',
 'ello',
  ...
 'hllo',
 'hlol',
 'lelo'}
 
# MrsWord is a string
In [6]: MrsWord("Hello") + " " + MrsWord("World")                                                                                                                                                        
Out[6]: 'Hello World'

In [7]: MrsWord("Hello {}").format("world")                                                                                                                                                      
Out[7]: 'Hello world'

# MrsSpellings work as sets
In [8]: MrsWord("hello").swap().union(MrsWord("world").delete())                                                                                                                        
Out[8]: {'ehllo', 'hello', 'helol', 'hlelo', 'orld', 'wold', 'word', 'worl', 'wrld'}

In [9]: MrsWord("hello").delete(1)-MrsWord("hello").delete(1)                                                                                                                                        
Out[9]: set()

In [10]: " ".join(MrsWord("Hello").qwerty_swap())                                                                                                                                                     
Out[10]: 'Helko Hdllo Yello He,lo Helll Hellp Hel,o Nello Heklo Hrllo H3llo Gello Heolo He:lo Helli Hell9 Heloo Hel:o Jello Hwllo'

Methods

deletion

Signature: MrsWord.delete(number_deletes=1)
Docstring:
delete some number `number_deletes` from this word

Args:
    number_deletes (int): number of deletions to perform

Returns:
    MrsSpellings (set): all possible misspellings that form as a result of `number_deletes` deletions

swapping

Signature: MrsWord.swap()
Docstring:
swap some consecutive characters

Args:

Returns:
    MrsSpellings (set): all possible misspellings that form as a result of swapping consecutive characters

qwerty distance (taxi-cab) based swapping

Signature: MrsWord.qwerty_swap(max_distance=1)
Docstring:

swap characters with their qwerty neighbors

Args:
    max_distance (int): the max distance (taxi-cab) of keys on the keyboard to swap
                        e.g. `max_distance=1` then "g" could become one of ["f", "h"]
                            `max_distance=2` then "g" could become one of ['f', 'h', 't', 'y', 'v', 'b']
                            Note: The number of swaps possible increases with distance however the increase is not always uniform.
                            For example, the 3rd set of keys from g is ['6', 'd', 'j'] while the second was ['t', 'y', 'v', 'b']
Returns:
    MrsSpellings (set): all possible misspellings that form as a result of swapping characters with qwerty neighbors

what is qwerty distance?

Qwerty distance is the distance between keys on the typical keyboard. For the purposes of this package, the following assumptions are made:

  • each row has half a key offset
  • the l1 distance is a good estimate of the natural travel distance between keys on the keyboard
  • the shift key can add distance by virtue of requiring a hold-down

Here is an example of the results of these assumptions. The closest keys grouped by equal distance (groups in ascending order to furthest distance) to the g key are:

[['f', 'h'],
 ['t', 'y', 'v', 'b'],
 ['6', 'd', 'j'],
 ['r', 'u', 'c', 'n'],
 ['^', '5', '7', 's', 'k'],
 ['e', 'i', 'x', 'm'],
 ['%', '&', '4', '8', 'a', 'l'],
 ['w', 'o', 'z', '<'],
 ['$', '*', '3', '9', ':'],
 ['q', 'p', ','],
 ['#', '(', '2', '0', ';'],
 ['[', '>'],
 ['@', ')', '1', '-', '"'],
 [']', '.'],
 ['!', '_', '`', '=', "'"],
 ['\\', '?'],
 ['~', '+', '{'],
 ['/'],
 ['}'],
 ['|']]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mrs_spellings-1.0.3.tar.gz (23.8 kB view details)

Uploaded Source

Built Distribution

mrs_spellings-1.0.3-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file mrs_spellings-1.0.3.tar.gz.

File metadata

  • Download URL: mrs_spellings-1.0.3.tar.gz
  • Upload date:
  • Size: 23.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.7.6 Linux/5.3.0-53-generic

File hashes

Hashes for mrs_spellings-1.0.3.tar.gz
Algorithm Hash digest
SHA256 b521df2c9aa8b2487d48562a5a1d2c41c5a6cc03b42ae91e0705f107c6402967
MD5 55d10996507fb6b196958afb52d30193
BLAKE2b-256 ac5b0813cd93b2888be6b2f7550765521c78b62ebe16a2766ea890ae99cd8f5b

See more details on using hashes here.

File details

Details for the file mrs_spellings-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: mrs_spellings-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.7.6 Linux/5.3.0-53-generic

File hashes

Hashes for mrs_spellings-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 faf32fa96417ee627fdb5dc51a3c14d2c3e8fc864e44f40772e36d68d8077fe4
MD5 9f296db57ce2e3c6fbaea4ce39d60246
BLAKE2b-256 442e466050c34438785dab0a21a7d33d9cd954e1c18d89cadf6713a580cc6a61

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page