Skip to main content

A spelling and grammar corrector for Icelandic

Project description

Overview

ReynirCorrect is a Python 3.x package for checking and correcting spelling and grammar in Icelandic text.

ReynirCorrect uses the Reynir package, by the same authors, to tokenize and parse text.

Token-level correction

ReynirCorrect can tokenize text and return a corrected token list. This catches token-level errors, such as spelling errors and erroneous phrases, but not grammatical errors.

Full grammar analysis

ReynirCorrect can also analyze text grammatically by attempting to parse it, after token-level correction. The parsing is done according to Reynir’s context-free grammar for Icelandic, augmented with additional production rules for common grammatical errors. The analysis returns a set of annotations (errors and suggestions) that apply to spans (consecutive tokens) within sentences in the resulting token list.

Status

This code is under development and has early Beta status. It will eventually become the foundation of a spelling and grammar checker that will be open to the public on the Greynir.is website. Of course it will also be available as an open-source Python package that can be installed using pip.

Example

To tokenize text with token-level correction (the text is not parsed in this case, so no grammar checking is done):

>>> from reynir_correct import tokenize
>>> g = tokenize("Af gefnu tilefni fékk fékk daninn vilja sýnum "
>>>     "framgengt við hana í auknu mæli.")
>>> for tok in g:
>>>     print("{0:10} {1}".format(tok.txt or "", tok.error_description))

Output:

Að         Orðasambandið 'Af gefnu tilefni' var leiðrétt í 'að gefnu tilefni'
gefnu
tilefni
fékk       Endurtekið orð ('fékk') var fellt burt
Daninn     Orð á að byrja á hástaf: 'daninn'
vilja      Orðasambandið 'vilja sýnum framgengt' var leiðrétt í 'vilja sínum framgengt'
sínum
framgengt
við
hana
í          Orðasambandið 'í auknu mæli' var leiðrétt í 'í auknum mæli'
auknum
mæli
.

To get a list of spelling and grammar annotations for a sentence:

>>> from reynir_correct import check_single
>>> sent = check_single("Páli, vini mínum, langaði að horfa á sjónvarpið.")
>>> for annotation in sent.annotations:
>>>     print("{0}".format(annotation))

Output:

000-004: E003  Frumlag sagnarinnar 'að langa' á að vera í þolfalli en ekki í þágufalli

               [ The subject of the verb 'að langa/to want' should be in the
                  accusative case, not the dative case ]

Note that the annotation.start and annotation.end properties (here start is 0 and end is 4) contain the indices of the first and last tokens to which the annotation applies. E003 is an error code.

Prerequisites

This package runs on CPython 3.4 or newer, and on PyPy 3.5 or newer.

Installation

To install this package:

$ pip3 install reynir-correct   # or pip install reynir-correct if Python3 is your default

If you want to be able to edit the source, do like so (assuming you have git installed):

$ git clone https://github.com/mideind/ReynirCorrect
$ cd ReynirCorrect
$ # [ Activate your virtualenv here if you have one ]
$ python setup.py develop

The package source code is now in ReynirCorrect/src/reynir_correct.

Tests

To run the built-in tests, install pytest, cd to your ReynirCorrect subdirectory (and optionally activate your virtualenv), then run:

$ python -m pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reynir-correct-0.6.0.tar.gz (4.0 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page