Skip to main content

Language model powered proof reader for correcting contextual errors in natural language.

Project description

lmproof - Language Model Proof Reader

Library to do proof-reading corrections for Grammatical Errors, spelling errors, confused word errors and other errors using pre-trained Language Models.

Usage

Install spacy model by python -m spacy download en. Then try out this snippet.

import lmproof
proof_reader = lmproof.load("en")
source = "The foxes living on the Shire is brown.'"
corrected = proof_reader.proofread(source) # "The foxes living in the Shire are brown."

How it works?

We use the language model based scoring approach mentioned in Christopher Bryant and Ted Briscoe. 2018 with few changes.

Unlike many approaches to GEC, this approach does NOT require annotated training data and mainly depends on a monolingual language model. The program works by iteratively comparing certain words in a text against alternative candidates and applying a correction if one of these candidates is more probable than the original word. These correction candidates are variously generated by a word inflection library or are otherwise defined manually. Currently, this system only corrects:

Non-words (e.g. freind and informations)
Morphology (e.g. eat, ate, eaten, eating, etc.)
Common Determiners and Prepositions (e.g. the, a, in, at, to, etc.)
Commonly Confused Words (e.g. bear/bare, lose/loose, etc.)

This work builds upon https://github.com/chrisjbryant/lmgec-lite/

Components

Language Models

Inflection generators

  • LemmInflect is used to lemmatize and generate inflections for candidate proposals to the language model.

Spell Checker

  • symspellpy is used for obtaining spell check candidates.

The components are highly modularised to facilitate experimentation with newer scorers and support more languages. Pre-trained language models for other languages, inflectors, common error patterns can be easily added to support more languages.

TODOs

  • Use edits in existing GEC corpus to generate candidates.
  • Tests
  • Publish benchmarks of the model.
  • Think of simple ways to generate insertion candidates.
  • Add more languages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmproof-0.3.0.tar.gz (640.5 kB view details)

Uploaded Source

Built Distribution

lmproof-0.3.0-py3-none-any.whl (639.8 kB view details)

Uploaded Python 3

File details

Details for the file lmproof-0.3.0.tar.gz.

File metadata

  • Download URL: lmproof-0.3.0.tar.gz
  • Upload date:
  • Size: 640.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.0 CPython/3.7.7 Linux/5.3.0-46-generic

File hashes

Hashes for lmproof-0.3.0.tar.gz
Algorithm Hash digest
SHA256 eb52c8e6a319ee91d9b8192ced3b0f9b3f70272f739ee1f6cb2dad7bd0210b5e
MD5 351a9931a81ae0b91b302ba5f20e14ef
BLAKE2b-256 e699b17a800fc017cf9af09cd3face05d2d7357dd6000824484b88f0f6e7daa6

See more details on using hashes here.

File details

Details for the file lmproof-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: lmproof-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 639.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.0 CPython/3.7.7 Linux/5.3.0-46-generic

File hashes

Hashes for lmproof-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5e4ad6108f7b9af04e0e59e05ee3e6f70f370bb8e15d3f55c6365768d080b94b
MD5 b1f3e612ac7a5e7c87dc29c4096bee84
BLAKE2b-256 15db627c4ad1b1b0631d22290a09bb108a26ef08369337c3be31ba61fd436d8d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page