Speliuk is a spell checker for the Ukrainian language based on SymSpell and Language Models.
Project description
Speliuk
A more accurate spelling correction for the Ukrainian language.
Motivation
When using a spell checker in systems that perform an automatic spelling correction without human verification, the following questions arise:
- How to avoid false correction, i.e. when a real word that is not present in a vocabulary is corrected? This is especially viable for fusional languages such as Ukrainian.
- How to find a single best correction for a misspelled word? Many spell checkers rely on the frequency of candidates and their edit distance discarding the surrounding context.
To address these issues, we propose a system that is compatible with any spell checker but focuses on precision over recall.
We improve the accuracy of a spell checker by using these complimentary models:
- KenLM. The model is used for fast perplexity calculation to find the best candidate for a misspelled word.
- Transfomer-based NER pipeline to detect misspelled words.
- SymSpell. As of now, this is the only supported spell checker.
Installation
- For CPU-only inference, install the CPU version of PyTorch.
- Make sure you can compile Python extension modules (required for KenLM). If you are on Linux, you can install them like this:
sudo apt-get install python-dev
- Install Speliuk:
pip install speliuk
Usage
By default, Speliuk will use pre-trained models stored on Hugging Face.
>>> from speliuk.correct import Speliuk
>>> speliuk = Speliuk()
>>> speliuk.load()
>>> speliuk.correct("то він моее це зраабити для меніе?")
Correction(corrected_text='то він може це зробити для мене?', annotations=[Annotation(start=7, end=11, source_text='моее', suggestions=['може'], meta={}), Annotation(start=15, end=23, source_text='зраабити', suggestions=['зробити'], meta={}), Annotation(start=28, end=33, source_text='меніе', suggestions=['мене'], meta={})])
Speliuk can also be used directly from a spaCy model:
>>> from speliuk.correct import CorrectionPipe
>>> nlp = spacy.blank('uk')
>>> nlp.add_pipe('speliuk', config=dict(spacy_spelling_model_path='/my/custom/model'))
>>> doc = nlp("то він моее це зраабити для меніе?")
>>> doc._.speliuk_corrected
'то він може це зробити для мене?'
>>> doc.spans["speliuk_errors"]
[моее, зраабити, меніе]
Training Details
Spelling Error Detection
To detect spelling errors, a spaCy NER model is used.
It was trained on a combination of synthetic and golden data:
- For synthetic data generation, we used UberText as base texts and nlpaug for errors generation. In total, 10k samples from different categories were used.
- For golden data, we used spelling errors from the UA-GEC corpus.
Perplexity Calculation
We used KenLM for quick perplexity calculation. We used an existing model Yehor/kenlm-uk trained on UberText.
Spell Checker
We used SymSpell for error correction. The dictionary consists of 500k most frequent words from the UberText corpus.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file speliuk-0.0.2.tar.gz.
File metadata
- Download URL: speliuk-0.0.2.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.8.0-40-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f27c2702bd7d523c8fec7137bc2193397a902be25a7509e49feacca948d799b0
|
|
| MD5 |
1f2800940d29a90a7d8e3fd2ffb4a036
|
|
| BLAKE2b-256 |
d07a9fb0a30389b9241c9aa9a043a20aba335ce976ea74181068e21f233253ce
|
File details
Details for the file speliuk-0.0.2-py3-none-any.whl.
File metadata
- Download URL: speliuk-0.0.2-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.8.0-40-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92ea6d70b7258e32cc3fc21416eeb772b2741e986376f5f7679c7856a3263918
|
|
| MD5 |
3b8071d87c42e8c0d2afb34e145be447
|
|
| BLAKE2b-256 |
f851236954ba00b98bf52d801e50f66c52440465fe5f889be33e234c3e9f71e0
|