Skip to main content

Speliuk is a spell checker for the Ukrainian language based on SymSpell and Language Models.

Project description

Speliuk

A more accurate spelling correction for the Ukrainian language.

Motivation

When using a spell checker in systems that perform an automatic spelling correction without human verification, the following questions arise:

  • How to avoid false correction, i.e. when a real word that is not present in a vocabulary is corrected? This is especially viable for fusional languages such as Ukrainian.
  • How to find a single best correction for a misspelled word? Many spell checkers rely on the frequency of candidates and their edit distance discarding the surrounding context.

To address these issues, we propose a system that is compatible with any spell checker but focuses on precision over recall.
We improve the accuracy of a spell checker by using these complimentary models:

  • KenLM. The model is used for fast perplexity calculation to find the best candidate for a misspelled word.
  • Transfomer-based NER pipeline to detect misspelled words.
  • SymSpell. As of now, this is the only supported spell checker.

Installation

pip install speliuk

Usage

By default, Speliuk will use pre-trained models stored on Hugging Face.

>>> from speliuk.correct import Speliuk
>>> speliuk = Speliuk()
>>> speliuk.load()
>>> speliuk.correct("то він моее це зраабити для меніе?")
Correction(corrected_text='то він може це зробити для мене?', annotations=[Annotation(start=7, end=11, source_text='моее', suggestions=['може'], meta={}), Annotation(start=15, end=23, source_text='зраабити', suggestions=['зробити'], meta={}), Annotation(start=28, end=33, source_text='меніе', suggestions=['мене'], meta={})])

Speliuk can also be used directly from a spaCy model:

>>> from speliuk.correct import CorrectionPipe
>>> nlp = spacy.blank('uk')
>>> nlp.add_pipe('speliuk', config=dict(spacy_spelling_model_path='/my/custom/model'))
>>> doc = nlp("то він моее це зраабити для меніе?")
>>> doc._.speliuk_corrected
'то він може це зробити для мене?'
>>> doc.spans["speliuk_errors"]
[моее, зраабити, меніе]

Training Details

Spelling Error Detection

To detect spelling errors, a spaCy NER model is used.

It was trained on using a combination of synthetic and golden data:

  • For synthetic data generation, we used UberText as base texts and nlpaug for errors generation. In total, 10k samples from different categories were used.
  • For golden data, we used spelling errors from the UA-GEC corpus.

Perplexity Calculation

We used KenLM for quick perplexity calculation. We used an existing model Yehor/kenlm-uk trained on UberText.

Spell Checker

We used SymSpell for error correction. The dictionary consists of 500k most frequent words from the UberText corpus.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speliuk-0.0.1.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

speliuk-0.0.1-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file speliuk-0.0.1.tar.gz.

File metadata

  • Download URL: speliuk-0.0.1.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.8.0-40-generic

File hashes

Hashes for speliuk-0.0.1.tar.gz
Algorithm Hash digest
SHA256 10cad563ab90cd8461f9b96b939538b0df9903ee42510cbbea7a4702dc0001ad
MD5 b359bf241bb601214f0f71ad39d1cd62
BLAKE2b-256 5ceb6eeef5db1d0940db0a4af074f9a5b6ec77ca41abd7e59c3448c8b1da5d84

See more details on using hashes here.

File details

Details for the file speliuk-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: speliuk-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.8.0-40-generic

File hashes

Hashes for speliuk-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 398be050eb202eb78636c49af517602c78276982d6e321746cee5bc121926848
MD5 2ec282aa2167c8457088197ac8309e6e
BLAKE2b-256 b44cae7fb0b11cdfc27e44b6f7b5323b029fbd9de623b9e04db02f04351f0508

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page