a micro utility for generating plausible misspellings
Project description
MrS SpELliNgS
a micro utility to procedurally generate plausible misspellings
Table of Contents
Install
from pypi
pip install mrs-spellings
from source
python -m pip install git+https://github.com/CircArgs/mrs_spellings.git
Use Cases
- Generate misspellings to replace during the text cleaning process with low overhead
- Replace words with their potential misspellings as an augmentation during
- training to make your model less susceptible to misspellings
- during test time as part of TTA
- Supplement an existing solution for out-of-vocabulary words/ words that do not appear in an existing replacement dictionary
Usage
There are 3 primary methods currently supported:
In [1]: from mrs_spellings import MrsWord, MrsSpellings
#methods return MrsSpellings
In [2]: MrsWord("hello").swap()
Out[2]: {'ehllo', 'hello', 'helol', 'hlelo'}
In [3]: MrsWord("hello").delete(number_deletes=1)
Out[3]: {'ello', 'hell', 'helo', 'hllo'}
In [4]: MrsWord("hello").qwerty_swap(max_distance=1)
Out[4]:
{'gello',
'h3llo',
'hdllo',
'he,lo',
'he:lo',
...
'jello',
'nello',
'yello'}
# simply chain methods
In [5]: MrsWord("hello").swap().delete()
Out[5]:
{'ehll',
'ehlo',
'ello',
...
'hllo',
'hlol',
'lelo'}
# MrsWord is a string
In [6]: MrsWord("Hello") + " " + MrsWord("World")
Out[6]: 'Hello World'
In [7]: MrsWord("Hello {}").format("world")
Out[7]: 'Hello world'
# MrsSpellings work as sets
In [8]: MrsWord("hello").swap().union(MrsWord("world").delete())
Out[8]: {'ehllo', 'hello', 'helol', 'hlelo', 'orld', 'wold', 'word', 'worl', 'wrld'}
In [9]: MrsWord("hello").delete(1)-MrsWord("hello").delete(1)
Out[9]: set()
In [10]: " ".join(MrsWord("Hello").qwerty_swap())
Out[10]: 'Helko Hdllo Yello He,lo Helll Hellp Hel,o Nello Heklo Hrllo H3llo Gello Heolo He:lo Helli Hell9 Heloo Hel:o Jello Hwllo'
Methods
deletion
Signature: MrsWord.delete(number_deletes=1)
Docstring:
delete some number `number_deletes` from this word
Args:
number_deletes (int): number of deletions to perform
Returns:
MrsSpellings (set): all possible misspellings that form as a result of `number_deletes` deletions
swapping
Signature: MrsWord.swap()
Docstring:
swap some consecutive characters
Args:
Returns:
MrsSpellings (set): all possible misspellings that form as a result of swapping consecutive characters
qwerty distance (taxi-cab) based swapping
Signature: MrsWord.qwerty_swap(max_distance=1)
Docstring:
swap characters with their qwerty neighbors
Args:
max_distance (int): the max distance (taxi-cab) of keys on the keyboard to swap
e.g. `max_distance=1` then "g" could become one of ["f", "h"]
`max_distance=2` then "g" could become one of ['f', 'h', 't', 'y', 'v', 'b']
Note: The number of swaps possible increases with distance however the increase is not always uniform.
For example, the 3rd set of keys from g is ['6', 'd', 'j'] while the second was ['t', 'y', 'v', 'b']
Returns:
MrsSpellings (set): all possible misspellings that form as a result of swapping characters with qwerty neighbors
what is qwerty distance?
Qwerty distance is the distance between keys on the typical keyboard. For the purposes of this package, the following assumptions are made:
- each row has half a key offset
- the l1 distance is a good estimate of the natural travel distance between keys on the keyboard
- the shift key can add distance by virtue of requiring a hold-down
Here is an example of the results of these assumptions. The closest keys grouped by equal distance (groups in ascending order to furthest distance) to the g
key are:
[['f', 'h'],
['t', 'y', 'v', 'b'],
['6', 'd', 'j'],
['r', 'u', 'c', 'n'],
['^', '5', '7', 's', 'k'],
['e', 'i', 'x', 'm'],
['%', '&', '4', '8', 'a', 'l'],
['w', 'o', 'z', '<'],
['$', '*', '3', '9', ':'],
['q', 'p', ','],
['#', '(', '2', '0', ';'],
['[', '>'],
['@', ')', '1', '-', '"'],
[']', '.'],
['!', '_', '`', '=', "'"],
['\\', '?'],
['~', '+', '{'],
['/'],
['}'],
['|']]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mrs_spellings-1.0.2.tar.gz
.
File metadata
- Download URL: mrs_spellings-1.0.2.tar.gz
- Upload date:
- Size: 27.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.7.6 Linux/5.3.0-53-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2421f80e5a9b2127e0e8c9375d9d0ddbdf1c13462266e4169df6f1beb6da7cb8 |
|
MD5 | 6cf5d8cc3ef70d42bac41c3c6aea05ad |
|
BLAKE2b-256 | ef7ac912ed5c823064a8dbbff1056889f6f52f684b7d79e4a52fe003e107645f |
File details
Details for the file mrs_spellings-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: mrs_spellings-1.0.2-py3-none-any.whl
- Upload date:
- Size: 25.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.7.6 Linux/5.3.0-53-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0712aeb81635940d63c959bc418890a7cf6a3c6a0e42309ca3d5d4fe0d3afaf |
|
MD5 | f8bc850b949b46b71e9b48e845276020 |
|
BLAKE2b-256 | 87eabedf3df3f6fb2a711dfd549f6cf1c5797054d1b5e82ece1a792e19d64dcf |