Contextual spell correction using BERT (bidirectional representations)
Project description
spellCheck
Contextual word checker for better suggestions
Types of spelling mistakes:
It is important to understand that, identifying the candidate is a big task. You can see the below quote from a research paper:
Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE.
This package currently focuses on Out of Vocabulary (OOV) word or non word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. In the future we the package gets traction, I would like to focus on RWE.
Install
The package can be installed using pip
Usage
How to load the package in spacy pipeline
>>> import contextualSpellCheck
>>> import spacy
>>>
>>> ## We require NER to identify if it is PERSON
>>> nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
>>>
>>> contextualSpellCheck.add_to_pipe(nlp)
<spacy.lang.en.English object at 0x12839a2d0>
>>> nlp.pipe_names
['ner', 'contextual spellchecker']
After adding contextual spell checker in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using extensions.
Using the pipeline
>>> doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>>
>>> # Doc Extention
>>> print(doc._.contextual_spellCheck)
True
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.suggestions_spellCheck)
{milion: ['million', 'billion', ',', 'trillion', 'Million', '%', '##M', 'annually', '##B', 'USD'], milion: ['billion', 'million', 'trillion', '##M', 'Million', '##B', 'USD', '##b', 'millions', '%']}
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.
>>> print(doc._.score_spellCheck)
{milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]}
>>>
>>> # Token Extention
>>> print(doc[4]._.get_require_spellCheck)
True
>>> print(doc[4]._.get_suggestion_spellCheck)
['million', 'billion', ',', 'trillion', 'Million', '%', '##M', 'annually', '##B', 'USD']
>>> print(doc[4]._.score_spellCheck)
[('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)]
>>>
>>> # Span Extention
>>> print(doc[2:6]._.get_has_spellCheck)
True
>>> print(doc[2:6]._.score_spellCheck)
[{$: []}, {9.4: []}, {milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)]}, {compared: []}]
>>>
>>>
API
At present, there is a get API in a flask app. You can run the app and expect the following output from the API.
{
"success": true,
"input": "Income was $9.4 milion compared to the prior year of $2.7 milion.",
"corrected": "Income was $9.4 million compared to the prior year of $2.7 million.",
"suggestion_score": {
"milion": [
[
"million",
0.59422
],
[
"billion",
0.24349
],
...
],
"milion:1": [
[
"billion",
0.65934
],
[
"million",
0.26185
],
...
]
}
}
Reference
Below are some of the projects/work I refered to while developing this package
- Spacy Documentation and custom attributes
- HuggingFace's Transformers
- Norvig's Blog
- Bert Paper: https://arxiv.org/abs/1810.04805
- Denoising words: https://arxiv.org/pdf/1910.14080.pdf
- CONTEXT BASED SPELLING CORRECTION (1990)
- How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach
- HuggingFace's neuralcoref for package design and some of the functions are inspired from them (like add_to_pipe which is an amazing idea!)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for contextualSpellCheck-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | dfb1db6b15e8fcb551cc15a98c44d090a7f08661a55bb7ae8c338087d36fbd96 |
|
MD5 | 26f6493cd178cfc2e93c007d3ac4c39b |
|
BLAKE2b-256 | 4d671fe9c56375d2020f3c8b53dae1a1fa6d7da747dfc671b7a283da0c3ed0cf |
Hashes for contextualSpellCheck-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbfdd5e92b86db0bf532e3a226acf593a1f118b416dcd888bd9a59e20bd27f02 |
|
MD5 | 0dcea6784f2c016a01a21540dceccdd7 |
|
BLAKE2b-256 | 286bdefa9ad31d21bd327265ced1b0d3b7bc69babdddcdd6a28bbab2384bf0de |