Contextual spell correction using BERT (bidirectional representations)

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

spellCheck

Contextual word checker for better suggestions

Types of spelling mistakes

It is essential to understand that identifying whether a candidate is a spelling error is a big task. You can see the below quote from a research paper:

Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE.

-- Monojit Choudhury et. al. (2007)

This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. In the coming days, I would like to focus on RWE and optimising the package by implementing it in cython.

Install

The package can be installed using pip. You would require python 3.6+

pip install contextualSpellCheck

Also, please install the dependencies from requirements.txt

Usage

Note: For other language examples check examples folder.

How to load the package in spacy pipeline

>>> import contextualSpellCheck
>>> import spacy
>>> 
>>> ## We require NER to identify if it is PERSON
>>> ## also require parser because we use Token.sent for context
>>> nlp = spacy.load("en_core_web_sm") 
>>> 
>>> contextualSpellCheck.add_to_pipe(nlp)
<spacy.lang.en.English object at 0x12839a2d0>
>>> nlp.pipe_names
['tagger', 'parser', 'ner', 'contextual spellchecker']
>>> 
>>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> doc._.outcome_spellCheck
'Income was $9.4 million compared to the prior year of $2.7 million.'

Or you can add to spaCy pipeline manually!

>>> import spacy
>>> import contextualSpellCheck
>>> 
>>> nlp = spacy.load('en')
>>> checker = contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck()
>>> nlp.add_pipe(checker)
>>> 
>>> doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.")
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.

After adding contextual spell checker in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using extensions.

Using the pipeline

>>> doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> 
>>> # Doc Extention
>>> print(doc._.contextual_spellCheck)
True
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.suggestions_spellCheck)
{milion: 'million', milion: 'million'}
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.
>>> print(doc._.score_spellCheck)
{milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]}
>>> 
>>> # Token Extention
>>> print(doc[4]._.get_require_spellCheck)
True
>>> print(doc[4]._.get_suggestion_spellCheck)
'million'
>>> print(doc[4]._.score_spellCheck)
[('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)]
>>> 
>>> # Span Extention
>>> print(doc[2:6]._.get_has_spellCheck)
True
>>> print(doc[2:6]._.score_spellCheck)
{$: [], 9.4: [], milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], compared: []}

Extensions

To make the usage simpler spacy provides custom extensions which a library can use. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the doc, span and token level. Below tables summaries the extensions.

`spaCy.Doc` level extensions

Extension	Type	Description	Default
doc._.contextual_spellCheck	`Boolean`	To check whether contextualSpellCheck is added as extension	`True`
doc._.performed_spellCheck	`Boolean`	To check whether contextualSpellCheck identified any misspells and performed correction	`False`
doc._.suggestions_spellCheck	`{Spacy.Token:str}`	if corrections are performed, it returns the mapping of misspell token (`spaCy.Token`) with suggested word(`str`)	`{}`
doc._.outcome_spellCheck	`str`	corrected sentence(`str`) as output	`""`
doc._.score_spellCheck	`{Spacy.Token:List(str,float)}`	if corrections are identified, it returns the mapping of misspell token (`spaCy.Token`) with suggested words(`str`) and probability of that correction	`None`

`spaCy.Span` level extensions

Extension	Type	Description	Default
span._.get_has_spellCheck	`Boolean`	To check whether contextualSpellCheck identified any misspells and performed correction in this span	`False`
span._.score_spellCheck	`{Spacy.Token:List(str,float)}`	if corrections are identified, it returns the mapping of misspell token (`spaCy.Token`) with suggested words(`str`) and probability of that correction for tokens in this `span`	`{spaCy.Token: []}`

`spaCy.Token` level extensions

Extension	Type	Description	Default
token._.get_require_spellCheck	`Boolean`	To check whether contextualSpellCheck identified any misspells and performed correction on this `token`	`False`
token._.get_suggestion_spellCheck	`str`	if corrections are performed, it returns the suggested word(`str`)	`""`
token._.score_spellCheck	`[(str,float)]`	if corrections are identified, it returns suggested words(`str`) and probability(`float`) of that correction	`[]`

API

At present, there is a simple GET API to get you started. You can run the app in your local and play with it.

Query: You can use the endpoint http://127.0.0.1:5000/?query=YOUR-QUERY Note: Your browser can handle the text encoding

http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion.

Response:

{
    "success": true,
    "input": "Income was $9.4 milion compared to the prior year of $2.7 milion.",
    "corrected": "Income was $9.4 milion compared to the prior year of $2.7 milion.",
    "suggestion_score": {
        "milion": [
            [
                "million",
                0.59422
            ],
            [
                "billion",
                0.24349
            ],
            ...
        ],
        "milion:1": [
            [
                "billion",
                0.65934
            ],
            [
                "million",
                0.26185
            ],
            ...
        ]
    }
}

Task List

Add support for Real Word Error (RWE) (Big Task)
edit distance code optimisation
add multi mask out capability
better candidate generation (maybe by fine tuning the model?)
add metric by testing on datasets
Improve documentation
Add examples for other langauges
use piece wise tokeniser when identifying the misspell
Improve logging in code
Update the logic of misspell identification (OOV) (#30)

Completed Task

specify maximum edit distance for candidateRanking
allow user to specify bert model
Include transformers deTokenizer to get better suggestions

Support and contribution

If you like the project, please ⭑ the project and show your support! Also, if you feel, the current behaviour is not as expected, please feel free to raise an issue. If you can help with any of the above tasks, please open a PR with necessary changes to documentation and tests.

Reference

Below are some of the projects/work I referred to while developing this package

Spacy Documentation and custom attributes
HuggingFace's Transformers
Norvig's Blog
Bert Paper: https://arxiv.org/abs/1810.04805
Denoising words: https://arxiv.org/pdf/1910.14080.pdf
CONTEXT BASED SPELLING CORRECTION (1990)
How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach
HuggingFace's neuralcoref for package design and some of the functions are inspired from them (like add_to_pipe which is an amazing idea!)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.4

Oct 18, 2023

0.4.3

Aug 22, 2022

0.4.2

Aug 8, 2022

0.4.1

Mar 28, 2021

0.4.0

Feb 16, 2021

0.3.4

Feb 10, 2021

This version

0.3.3

Dec 21, 2020

0.3.2

Oct 25, 2020

0.3.1 yanked

Oct 25, 2020

Reason this release was yanked:

had to yank it because of debug prints in production package

0.3.0

Sep 6, 2020

0.2.1

Aug 23, 2020

0.2.0

Jul 27, 2020

0.1.1

Jun 13, 2020

0.1.0

May 24, 2020

0.0.4

May 14, 2020

0.0.3

May 14, 2020

0.0.2

May 14, 2020

0.0.1

May 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextualSpellCheck-0.3.3.tar.gz (130.6 kB view hashes)

Uploaded Dec 21, 2020 Source

Built Distribution

contextualSpellCheck-0.3.3-py3-none-any.whl (131.8 kB view hashes)

Uploaded Dec 21, 2020 Python 3

Hashes for contextualSpellCheck-0.3.3.tar.gz

Hashes for contextualSpellCheck-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`be857f0a168aa3928746c056374a66cdf20411ddc556766d9d1a5827dfa25785`
MD5	`46448132bb5e65497945869b1d42f9f8`
BLAKE2b-256	`7fa3ed2bec12a1f964e41fd54b6064744828a4b745ad9186c49dad9fa5bc37c9`

Hashes for contextualSpellCheck-0.3.3-py3-none-any.whl

Hashes for contextualSpellCheck-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0124217aafcd50ba941ea12985916a22add7362786742718b9c022a6d80f6d01`
MD5	`082f420778a36567770a410e8905b159`
BLAKE2b-256	`a13e16645dd6adc9e504f2e92d3c995deb4802cadca33638976fec4f66725f2e`

contextualSpellCheck 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

spellCheck

Types of spelling mistakes

Install

Usage

How to load the package in spacy pipeline

Using the pipeline

Extensions

`spaCy.Doc` level extensions

`spaCy.Span` level extensions

`spaCy.Token` level extensions

API

Task List

Support and contribution

Reference

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

contextualSpellCheck 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

spellCheck

Types of spelling mistakes

Install

Usage

How to load the package in spacy pipeline

Using the pipeline

Extensions

spaCy.Doc level extensions

spaCy.Span level extensions

spaCy.Token level extensions

API

Task List

Support and contribution

Reference

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`spaCy.Doc` level extensions

`spaCy.Span` level extensions

`spaCy.Token` level extensions