Skip to main content

Speliuk is a spell checker for the Ukrainian language based on SymSpell and Language Models.

Project description

Speliuk

A more accurate spelling correction for the Ukrainian language.

Motivation

When using a spell checker in systems that perform an automatic spelling correction without human verification, the following questions arise:

  • How to avoid false correction, i.e. when a real word that is not present in a vocabulary is corrected? This is especially viable for fusional languages such as Ukrainian.
  • How to find a single best correction for a misspelled word? Many spell checkers rely on the frequency of candidates and their edit distance discarding the surrounding context.

To address these issues, we propose a system that is compatible with any spell checker but focuses on precision over recall.
We improve the accuracy of a spell checker by using these complimentary models:

  • KenLM. The model is used for fast perplexity calculation to find the best candidate for a misspelled word.
  • Transfomer-based NER pipeline to detect misspelled words.
  • SymSpell. As of now, this is the only supported spell checker.

Installation

  1. For CPU-only inference, install the CPU version of PyTorch.
  2. Make sure you can compile Python extension modules (required for KenLM). If you are on Linux, you can install them like this:
sudo apt-get install python-dev
  1. Install Speliuk:
pip install speliuk

Usage

By default, Speliuk will use pre-trained models stored on Hugging Face.

>>> from speliuk.correct import Speliuk
>>> speliuk = Speliuk()
>>> speliuk.load()
>>> speliuk.correct("то він моее це зраабити для меніе?")
Correction(corrected_text='то він може це зробити для мене?', annotations=[Annotation(start=7, end=11, source_text='моее', suggestions=['може'], meta={}), Annotation(start=15, end=23, source_text='зраабити', suggestions=['зробити'], meta={}), Annotation(start=28, end=33, source_text='меніе', suggestions=['мене'], meta={})])

Speliuk can also be used directly from a spaCy model:

>>> from speliuk.correct import CorrectionPipe
>>> nlp = spacy.blank('uk')
>>> nlp.add_pipe('speliuk', config=dict(spacy_spelling_model_path='/my/custom/model'))
>>> doc = nlp("то він моее це зраабити для меніе?")
>>> doc._.speliuk_corrected
'то він може це зробити для мене?'
>>> doc.spans["speliuk_errors"]
[моее, зраабити, меніе]

Training Details

Spelling Error Detection

To detect spelling errors, a spaCy NER model is used.

It was trained on a combination of synthetic and golden data:

  • For synthetic data generation, we used UberText as base texts and nlpaug for errors generation. In total, 10k samples from different categories were used.
  • For golden data, we used spelling errors from the UA-GEC corpus.

Perplexity Calculation

We used KenLM for quick perplexity calculation. We used an existing model Yehor/kenlm-uk trained on UberText.

Spell Checker

We used SymSpell for error correction. The dictionary consists of 500k most frequent words from the UberText corpus.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speliuk-0.0.2.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

speliuk-0.0.2-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file speliuk-0.0.2.tar.gz.

File metadata

  • Download URL: speliuk-0.0.2.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.8.0-40-generic

File hashes

Hashes for speliuk-0.0.2.tar.gz
Algorithm Hash digest
SHA256 f27c2702bd7d523c8fec7137bc2193397a902be25a7509e49feacca948d799b0
MD5 1f2800940d29a90a7d8e3fd2ffb4a036
BLAKE2b-256 d07a9fb0a30389b9241c9aa9a043a20aba335ce976ea74181068e21f233253ce

See more details on using hashes here.

File details

Details for the file speliuk-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: speliuk-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.8.0-40-generic

File hashes

Hashes for speliuk-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 92ea6d70b7258e32cc3fc21416eeb772b2741e986376f5f7679c7856a3263918
MD5 3b8071d87c42e8c0d2afb34e145be447
BLAKE2b-256 f851236954ba00b98bf52d801e50f66c52440465fe5f889be33e234c3e9f71e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page