Skip to main content

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian language

Project description

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

This package contains the UA-GEC data and code to work with it.

Python library

There is a Python package that consists of the data and the code to work with it.

Getting started

A simple way to install the package is by pip:

    $ pip install ua_gec==1.0

Alternatively, you can install it from source:

    $ cd python
    $ python setup.py develop

Iterating through corpus

Once installed, you may get annotated documents from Python code:

    >>> from ua_gec import Corpus
    >>> corpus = Corpus(partition="train")
    >>> for doc in corpus:
    ...     print(doc.source)
    ...     print(doc.target)
    ...     print(doc.annotated)
    ...     print(doc.meta.region)

Working with annotations

[The docs are under construction]

Train-test split

We expect users of the corpus train and tune their models on the train split only (of course, you are free to further split it into train-dev or use cross-validation). Use the test split for reporting scores of your final models. Never optimize on the test set. Do not tune hyperparameters on it. And please, do not use it for model selection in any way.

The Statistics for the per-split statistics.

Annotation format

Annotated files are text file that use the following in-text annotation format: {error=>edit:::error_type=Tag}, where error and edit stand for the text item before and after correction, respectively, and Tag denotes an error category (Grammar, Spelling, Punctuation, or Fluency).

Example of an annotated text:

    I {like=>likes:::error_type=Grammar} turtles.

An accompanying Python package, ua_gec, provides many tools for working with annotated texts. See its documentation for details.

Statistics

UA-GEC contains:

Split Documents Sentences Tokens Authors
train 851 18,225 285,247 416
test 160 2,490 43,432 76
TOTAL 1,011 20,715 328,779 492

The corpus statistics can be generated by running a script from the Python package (note that the ua-gec package must be installed first):

$ python ./python/ua_gec/stats.py

Contributing

  • The data collection is an ongoing activity. You can always contribute your Ukrainian writings or complete one of the writing tasks at https://ua-gec-dataset.grammarly.ai/

  • Code improvements and document are welcomed. Please, open a pull request.

Contacts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ua_gec-1.0.0.tar.gz (1.2 MB view hashes)

Uploaded Source

Built Distribution

ua_gec-1.0.0-py3-none-any.whl (1.6 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page