UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian language
Project description
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
This package contains the UA-GEC data and code to work with it.
Python library
There is a Python package that consists of the data and the code to work with it.
Getting started
A simple way to install the package is by pip
:
$ pip install ua_gec==1.0
Alternatively, you can install it from source:
$ cd python
$ python setup.py develop
Iterating through corpus
Once installed, you may get annotated documents from Python code:
>>> from ua_gec import Corpus
>>> corpus = Corpus(partition="train")
>>> for doc in corpus:
... print(doc.source)
... print(doc.target)
... print(doc.annotated)
... print(doc.meta.region)
Working with annotations
[The docs are under construction]
Train-test split
We expect users of the corpus train and tune their models on the train split only (of course, you are free to further split it into train-dev or use cross-validation). Use the test split for reporting scores of your final models. Never optimize on the test set. Do not tune hyperparameters on it. And please, do not use it for model selection in any way.
The Statistics for the per-split statistics.
Annotation format
Annotated files are text file that use the following in-text annotation format:
{error=>edit:::error_type=Tag}
, where error
and edit
stand for the text item before
and after correction, respectively, and Tag
denotes an error category
(Grammar
, Spelling
, Punctuation
, or Fluency
).
Example of an annotated text:
I {like=>likes:::error_type=Grammar} turtles.
An accompanying Python package, ua_gec
, provides many tools for working with
annotated texts. See its documentation for details.
Statistics
UA-GEC contains:
Split | Documents | Sentences | Tokens | Authors |
---|---|---|---|---|
train | 851 | 18,225 | 285,247 | 416 |
test | 160 | 2,490 | 43,432 | 76 |
TOTAL | 1,011 | 20,715 | 328,779 | 492 |
The corpus statistics can be generated by running a script from the Python
package (note that the ua-gec
package must be installed first):
$ python ./python/ua_gec/stats.py
Contributing
-
The data collection is an ongoing activity. You can always contribute your Ukrainian writings or complete one of the writing tasks at https://ua-gec-dataset.grammarly.ai/
-
Code improvements and document are welcomed. Please, open a pull request.
Contacts
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.