Speliuk is a spell checker for the Ukrainian language based on SymSpell and Language Models.

These details have not been verified by PyPI

Project description

Speliuk

A more accurate spelling correction for the Ukrainian language.

Motivation

When using a spell checker in systems that perform an automatic spelling correction without human verification, the following questions arise:

How to avoid false correction, i.e. when a real word that is not present in a vocabulary is corrected? This is especially viable for fusional languages such as Ukrainian.
How to find a single best correction for a misspelled word? Many spell checkers rely on the frequency of candidates and their edit distance discarding the surrounding context.

To address these issues, we propose a system that is compatible with any spell checker but focuses on precision over recall.
We improve the accuracy of a spell checker by using these complimentary models:

KenLM. The model is used for fast perplexity calculation to find the best candidate for a misspelled word.
Transfomer-based NER pipeline to detect misspelled words.
SymSpell. As of now, this is the only supported spell checker.

Installation

For CPU-only inference, install the CPU version of PyTorch.
Make sure you can compile Python extension modules (required for KenLM). If you are on Linux, you can install them like this:

sudo apt-get install python-dev

Install Speliuk:

pip install speliuk

Usage

By default, Speliuk will use pre-trained models stored on Hugging Face.

>>> from speliuk.correct import Speliuk
>>> speliuk = Speliuk()
>>> speliuk.load()
>>> speliuk.correct("то він моее це зраабити для меніе?")
Correction(corrected_text='то він може це зробити для мене?', annotations=[Annotation(start=7, end=11, source_text='моее', suggestions=['може'], meta={}), Annotation(start=15, end=23, source_text='зраабити', suggestions=['зробити'], meta={}), Annotation(start=28, end=33, source_text='меніе', suggestions=['мене'], meta={})])

Speliuk can also be used directly from a spaCy model:

>>> from speliuk.correct import CorrectionPipe
>>> nlp = spacy.blank('uk')
>>> nlp.add_pipe('speliuk', config=dict(spacy_spelling_model_path='/my/custom/model'))
>>> doc = nlp("то він моее це зраабити для меніе?")
>>> doc._.speliuk_corrected
'то він може це зробити для мене?'
>>> doc.spans["speliuk_errors"]
[моее, зраабити, меніе]

Training Details

Spelling Error Detection

To detect spelling errors, a spaCy NER model is used.

It was trained on a combination of synthetic and golden data:

For synthetic data generation, we used UberText as base texts and nlpaug for errors generation. In total, 10k samples from different categories were used.
For golden data, we used spelling errors from the UA-GEC corpus.

Perplexity Calculation

We used KenLM for quick perplexity calculation. We used an existing model Yehor/kenlm-uk trained on UberText.

Spell Checker

We used SymSpell for error correction. The dictionary consists of 500k most frequent words from the UberText corpus.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.2

Sep 15, 2024

0.0.1

Sep 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speliuk-0.0.2.tar.gz (7.2 kB view details)

Uploaded Sep 15, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

speliuk-0.0.2-py3-none-any.whl (7.9 kB view details)

Uploaded Sep 15, 2024 Python 3

File details

Details for the file speliuk-0.0.2.tar.gz.

File metadata

Download URL: speliuk-0.0.2.tar.gz
Upload date: Sep 15, 2024
Size: 7.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.8.0-40-generic

File hashes

Hashes for speliuk-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`f27c2702bd7d523c8fec7137bc2193397a902be25a7509e49feacca948d799b0`
MD5	`1f2800940d29a90a7d8e3fd2ffb4a036`
BLAKE2b-256	`d07a9fb0a30389b9241c9aa9a043a20aba335ce976ea74181068e21f233253ce`

See more details on using hashes here.

File details

Details for the file speliuk-0.0.2-py3-none-any.whl.

File metadata

Download URL: speliuk-0.0.2-py3-none-any.whl
Upload date: Sep 15, 2024
Size: 7.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.8.0-40-generic

File hashes

Hashes for speliuk-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92ea6d70b7258e32cc3fc21416eeb772b2741e986376f5f7679c7856a3263918`
MD5	`3b8071d87c42e8c0d2afb34e145be447`
BLAKE2b-256	`f851236954ba00b98bf52d801e50f66c52440465fe5f889be33e234c3e9f71e0`

See more details on using hashes here.

speliuk 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Speliuk

Motivation

Installation

Usage

Training Details

Spelling Error Detection

Perplexity Calculation

Spell Checker

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes