Skip to main content

a micro utility for generating plausible misspellings

Project description

MrS SpELliNgS

a micro utility to procedurally generate plausible misspellings


Table of Contents


Install

from pypi

pip install mrs-spellings

from source

python -m pip install git+https://github.com/CircArgs/mrs_spellings.git

Use Cases

  • Generate misspellings to replace during the text cleaning process with low overhead
  • Replace words with their potential misspellings as an augmentation during
    • training to make your model less susceptible to misspellings
    • during test time as part of TTA
  • Supplement an existing solution for out-of-vocabulary words/ words that do not appear in an existing replacement dictionary

Usage

There are 3 primary methods currently supported:

In [1]: from mrs_spellings import MrsWord, MrsSpellings                                                                                                                                                            
#methods return MrsSpellings
In [2]: MrsWord("hello").swap()                                                                                                                                                                      
Out[2]: {'ehllo', 'hello', 'helol', 'hlelo'}

In [3]: MrsWord("hello").delete(number_deletes=1)                                                                                                                                                    
Out[3]: {'ello', 'hell', 'helo', 'hllo'}

In [4]: MrsWord("hello").qwerty_swap(max_distance=1)                                                                                                                                                 
Out[4]: 
{'gello',
 'h3llo',
 'hdllo',
 'he,lo',
 'he:lo',
  ...
 'jello',
 'nello',
 'yello'}
# simply chain methods
In [5]: MrsWord("hello").swap().delete()                                                                                                                                                             
Out[5]: 
{'ehll',
 'ehlo',
 'ello',
  ...
 'hllo',
 'hlol',
 'lelo'}
 
# MrsWord is a string
In [6]: MrsWord("Hello") + " " + MrsWord("World")                                                                                                                                                        
Out[6]: 'Hello World'

In [7]: MrsWord("Hello {}").format("world")                                                                                                                                                      
Out[7]: 'Hello world'

# MrsSpellings work as sets
In [8]: MrsWord("hello").swap().union(MrsWord("world").delete())                                                                                                                        
Out[8]: {'ehllo', 'hello', 'helol', 'hlelo', 'orld', 'wold', 'word', 'worl', 'wrld'}

In [9]: MrsWord("hello").delete(1)-MrsWord("hello").delete(1)                                                                                                                                        
Out[9]: set()

In [10]: " ".join(MrsWord("Hello").qwerty_swap())                                                                                                                                                     
Out[10]: 'Helko Hdllo Yello He,lo Helll Hellp Hel,o Nello Heklo Hrllo H3llo Gello Heolo He:lo Helli Hell9 Heloo Hel:o Jello Hwllo'

Methods

deletion

Signature: MrsWord.delete(number_deletes=1)
Docstring:
delete some number `number_deletes` from this word

Args:
    number_deletes (int): number of deletions to perform

Returns:
    MrsSpellings (set): all possible misspellings that form as a result of `number_deletes` deletions

swapping

Signature: MrsWord.swap()
Docstring:
swap some consecutive characters

Args:

Returns:
    MrsSpellings (set): all possible misspellings that form as a result of swapping consecutive characters

qwerty distance (taxi-cab) based swapping

Signature: MrsWord.qwerty_swap(max_distance=1)
Docstring:

swap characters with their qwerty neighbors

Args:
    max_distance (int): the max distance (taxi-cab) of keys on the keyboard to swap
                        e.g. `max_distance=1` then "g" could become one of ["f", "h"]
                            `max_distance=2` then "g" could become one of ['f', 'h', 't', 'y', 'v', 'b']
                            Note: The number of swaps possible increases with distance however the increase is not always uniform.
                            For example, the 3rd set of keys from g is ['6', 'd', 'j'] while the second was ['t', 'y', 'v', 'b']
Returns:
    MrsSpellings (set): all possible misspellings that form as a result of swapping characters with qwerty neighbors

what is qwerty distance?

Qwerty distance is the distance between keys on the typical keyboard. For the purposes of this package, the following assumptions are made:

  • each row has half a key offset
  • the l1 distance is a good estimate of the natural travel distance between keys on the keyboard
  • the shift key can add distance by virtue of requiring a hold-down

Here is an example of the results of these assumptions. The closest keys grouped by equal distance (groups in ascending order to furthest distance) to the g key are:

[['f', 'h'],
 ['t', 'y', 'v', 'b'],
 ['6', 'd', 'j'],
 ['r', 'u', 'c', 'n'],
 ['^', '5', '7', 's', 'k'],
 ['e', 'i', 'x', 'm'],
 ['%', '&', '4', '8', 'a', 'l'],
 ['w', 'o', 'z', '<'],
 ['$', '*', '3', '9', ':'],
 ['q', 'p', ','],
 ['#', '(', '2', '0', ';'],
 ['[', '>'],
 ['@', ')', '1', '-', '"'],
 [']', '.'],
 ['!', '_', '`', '=', "'"],
 ['\\', '?'],
 ['~', '+', '{'],
 ['/'],
 ['}'],
 ['|']]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mrs_spellings-1.0.2.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

mrs_spellings-1.0.2-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file mrs_spellings-1.0.2.tar.gz.

File metadata

  • Download URL: mrs_spellings-1.0.2.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.7.6 Linux/5.3.0-53-generic

File hashes

Hashes for mrs_spellings-1.0.2.tar.gz
Algorithm Hash digest
SHA256 2421f80e5a9b2127e0e8c9375d9d0ddbdf1c13462266e4169df6f1beb6da7cb8
MD5 6cf5d8cc3ef70d42bac41c3c6aea05ad
BLAKE2b-256 ef7ac912ed5c823064a8dbbff1056889f6f52f684b7d79e4a52fe003e107645f

See more details on using hashes here.

File details

Details for the file mrs_spellings-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: mrs_spellings-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.7.6 Linux/5.3.0-53-generic

File hashes

Hashes for mrs_spellings-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b0712aeb81635940d63c959bc418890a7cf6a3c6a0e42309ca3d5d4fe0d3afaf
MD5 f8bc850b949b46b71e9b48e845276020
BLAKE2b-256 87eabedf3df3f6fb2a711dfd549f6cf1c5797054d1b5e82ece1a792e19d64dcf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page