Skip to main content

Contextual spell correction using BERT (bidirectional representations)

Project description

spellCheck

Contextual word checker for better suggestions

GitHub PyPI PyPI - Downloads

Types of spelling mistakes

It is important to understand that, identifying the candidate is a big task. You can see the below quote from a research paper:

Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE.

-- Monojit Choudhury et. al. (2007)

This package currently focuses on Out of Vocabulary (OOV) word or non word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. In the future if the package gets traction, I would like to focus on RWE.

Install

The package can be installed using pip. You would require python 3.6+

pip install contextualSpellCheck

Also, please install the dependencies from requirements.txt

Usage

How to load the package in spacy pipeline

>>> import contextualSpellCheck
>>> import spacy
>>> 
>>> ## We require NER to identify if it is PERSON
>>> nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
>>> 
>>> contextualSpellCheck.add_to_pipe(nlp)
<spacy.lang.en.English object at 0x12839a2d0>
>>> nlp.pipe_names
['ner', 'contextual spellchecker']

After adding contextual spell checker in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using extensions.

Using the pipeline

>>> doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> 
>>> # Doc Extention
>>> print(doc._.contextual_spellCheck)
True
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.suggestions_spellCheck)
{milion: ['million', 'billion', ',', 'trillion', 'Million', '%', '##M', 'annually', '##B', 'USD'], milion: ['billion', 'million', 'trillion', '##M', 'Million', '##B', 'USD', '##b', 'millions', '%']}
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.
>>> print(doc._.score_spellCheck)
{milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]}
>>> 
>>> # Token Extention
>>> print(doc[4]._.get_require_spellCheck)
True
>>> print(doc[4]._.get_suggestion_spellCheck)
['million', 'billion', ',', 'trillion', 'Million', '%', '##M', 'annually', '##B', 'USD']
>>> print(doc[4]._.score_spellCheck)
[('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)]
>>> 
>>> # Span Extention
>>> print(doc[2:6]._.get_has_spellCheck)
True
>>> print(doc[2:6]._.score_spellCheck)
[{$: []}, {9.4: []}, {milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)]}, {compared: []}]

API

At present, there is a get API in a flask app. You can run the app and expect the following output from the API.

{
    "success": true,
    "input": "Income was $9.4 milion compared to the prior year of $2.7 milion.",
    "corrected": "Income was $9.4 million compared to the prior year of $2.7 million.",
    "suggestion_score": {
        "milion": [
            [
                "million",
                0.59422
            ],
            [
                "billion",
                0.24349
            ],
            ...
        ],
        "milion:1": [
            [
                "billion",
                0.65934
            ],
            [
                "million",
                0.26185
            ],
            ...
        ]
    }
}

Task List

  • Add support for Real Word Error (RWE) (Big Task)
  • specify maximum edit distance for candidateRanking
  • allow user to specify bert model
  • edit distance code optimisation
  • add multi mask out capability
  • better candidate generation (maybe by fine tuning the model?)
  • add metric by testing on datasets
  • Improve documentation

Reference

Below are some of the projects/work I refered to while developing this package

  1. Spacy Documentation and custom attributes
  2. HuggingFace's Transformers
  3. Norvig's Blog
  4. Bert Paper: https://arxiv.org/abs/1810.04805
  5. Denoising words: https://arxiv.org/pdf/1910.14080.pdf
  6. CONTEXT BASED SPELLING CORRECTION (1990)
  7. How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach
  8. HuggingFace's neuralcoref for package design and some of the functions are inspired from them (like add_to_pipe which is an amazing idea!)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextualSpellCheck-0.0.3.tar.gz (123.0 kB view hashes)

Uploaded Source

Built Distribution

contextualSpellCheck-0.0.3-py3-none-any.whl (125.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page