UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian language

These details have not been verified by PyPI

Project links

Homepage

Project description

Українською

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

This repository contains UA-GEC data and an accompanying Python library.

What's new

November 2022: Version 2.0 released, featuring more data and detailed annotations.
January 2021: Initial release.

See CHANGELOG.md for detailed updates.

Data

All corpus data and metadata stay under the ./data. It has two subfolders for gec-fluency and gec-only corpus versions

Both corpus versions contain two subfolders train and test splits with different data representations:

./data/{gec-fluency,gec-only}/{train,test}/annotated stores documents in the annotated format

./data/{gec-fluency,gec-only}/{train,test}/source and ./data/{gec-fluency,gec-only}/{train,test}/target store the original and the corrected versions of documents. Text files in these directories are plain text with no annotation markup. These files were produced from the annotated data and are, in some way, redundant. We keep them because this format is convenient in some use cases.

Metadata

./data/metadata.csv stores per-document metadata. It's a CSV file with the following fields:

id (str): document identifier;
author_id (str): document author identifier;
is_native (int): 1 if the author is native-speaker, 0 otherwise;
region (str): the author's region of birth. A special value "Інше" is used both for authors who were born outside Ukraine and authors who preferred not to specify their region.
gender (str): could be "Жіноча" (female), "Чоловіча" (male), or "Інша" (other);
occupation (str): one of "Технічна", "Гуманітарна", "Природнича", "Інша";
submission_type (str): one of "essay", "translation", or "text_donation";
source_language (str): for submissions of the "translation" type, this field indicates the source language of the translated text. Possible values are "de", "en", "fr", "ru", and "pl";
annotator_id (int): ID of the annotator who corrected the document;
partition (str): one of "test" or "train";
is_sensitive (int): 1 if the document contains profanity or offensive language.

Annotation format

Annotated files are text files that use the following in-text annotation format: {error=>edit:::error_type=Tag}, where error and edit stand for a text item before and after correction respectively, and Tag denotes an error category and an error subcategory in case of Grammar- and Fluency-related errors.

Example of an annotated sentence:

    I {likes=>like:::error_type=G/Number} turtles.

Below you can see a list of error types presented in the corpus:

Spelling: spelling errors;
Punctuation: punctuation errors.

Grammar-related errors:

G/Case: incorrect usage of case of any notional part of speech;
G/Gender: incorrect usage of gender of any notional part of speech;
G/Number: incorrect usage of number of any notional part of speech;
G/Aspect: incorrect usage of verb aspect;
G/Tense: incorrect usage of verb tense;
G/VerbVoice: incorrect usage of verb voice;
G/PartVoice: incorrect usage of participle voice;
G/VerbAForm: incorrect usage of an analytical verb form;
G/Prep: incorrect preposition usage;
G/Participle: incorrect usage of participles;
G/UngrammaticalStructure: digression from syntactic norms;
G/Comparison: incorrect formation of comparison degrees of adjectives and adverbs;
G/Conjunction: incorrect usage of conjunctions;
G/Other: other grammatical errors.

Fluency-related errors:

F/Style: style errors;
F/Calque: word-for-word translation from other languages;
F/Collocation: unnatural collocations;
F/PoorFlow: unnatural sentence flow;
F/Repetition: repetition of words;
F/Other: other fluency errors.

An accompanying Python package, ua_gec, provides many tools for working with annotated texts. See its documentation for details.

Train-test split

We expect users of the corpus to train and tune their models on the train split only. Feel free to further split it into train-dev (or use cross-validation).

Please use the test split only for reporting scores of your final model. In particular, never optimize on the test set. Do not tune hyperparameters on it. Do not use it for model selection in any way.

Next section lists the per-split statistics.

Statistics

UA-GEC contains:

GEC+Fluency

Split	Documents	Sentences	Tokens	Authors	Errors
train	1,706	31,038	457,017	752	38,213
test	166	2,697	43,601	76	7,858
TOTAL	1,872	33,735	500,618	828	46,071

See stats.gec-fluency.txt for detailed statistics.

GEC-only

Split	Documents	Sentences	Tokens	Authors	Errors
train	1,706	31,046	457,004	752	30,049
test	166	2,704	43,605	76	6,169
TOTAL	1,872	33,750	500,609	828	36,218

See stats.gec-only.txt for detailed statistics.

Python library

Alternatively to operating on data files directly, you may use a Python package called ua_gec. This package includes the data and has classes to iterate over documents, read metadata, work with annotations, etc.

Getting started

The package can be easily installed by pip:

    $ pip install ua_gec

Alternatively, you can install it from the source code:

    $ cd python
    $ python setup.py develop

Iterating through corpus

Once installed, you may get annotated documents from the Python code:

    
    >>> from ua_gec import Corpus
    >>> corpus = Corpus(partition="train", annotation_layer="gec-only")
    >>> for doc in corpus:
    ...     print(doc.source)         # "I likes it."
    ...     print(doc.target)         # "I like it."
    ...     print(doc.annotated)      # <AnnotatedText("I {likes=>like} it.")
    ...     print(doc.meta.region)    # "Київська"

Note that the doc.annotated property is of type AnnotatedText. This class is described in the next section

Working with annotations

ua_gec.AnnotatedText is a class that provides tools for processing annotated texts. It can iterate over annotations, get annotation error type, remove some of the annotations, and more.

Here is an example to get you started. It will remove all F/Style annotations from a text:

    >>> from ua_gec import AnnotatedText
    >>> text = AnnotatedText("I {likes=>like:::error_type=G/Number} it.")
    >>> for ann in text.iter_annotations():
    ...     print(ann.source_text)       # likes
    ...     print(ann.top_suggestion)    # like
    ...     print(ann.meta)              # {'error_type': 'Grammar'}
    ...     if ann.meta["error_type"] == "F/Style":
    ...         text.remove(ann)         # or `text.apply(ann)`

Multiple annotators

Some documents are annotated with multiple annotators. Such documents share doc_id but differ in doc.meta.annotator_id.

Currently, test sets for gec-fluency and gec-only are annotated by two annotators. The train sets contain 45 double-annotated docs.

Contributing

Data and code improvements are welcomed. Please submit a pull request.

Citation

The accompanying paper is:

@misc{syvokon2021uagec,
      title={UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language},
      author={Oleksiy Syvokon and Olena Nahorna},
      year={2021},
      eprint={2103.16997},
      archivePrefix={arXiv},
      primaryClass={cs.CL}}

Contacts

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.1.3

Feb 21, 2024

2.1.2

Feb 2, 2024

2.1.1

Feb 2, 2024

2.1.0

Feb 2, 2024

2.0.0

Nov 8, 2022

1.2.1

Jul 17, 2021

1.1.1

Apr 22, 2021

1.1.0

Feb 5, 2021

1.0.0

Jan 19, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ua_gec-2.1.3.tar.gz (23.3 MB view details)

Uploaded Feb 21, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ua_gec-2.1.3-py3-none-any.whl (36.0 MB view details)

Uploaded Feb 21, 2024 Python 3

File details

Details for the file ua_gec-2.1.3.tar.gz.

File metadata

Download URL: ua_gec-2.1.3.tar.gz
Upload date: Feb 21, 2024
Size: 23.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for ua_gec-2.1.3.tar.gz
Algorithm	Hash digest
SHA256	`6305864f9cfba7385212c50dfb7c66e3a88ffa69c3014eb249f55b40297cfe51`
MD5	`046d8246014a94e30e951efaac558743`
BLAKE2b-256	`e4436208c0a3b4450a75c0c7d03faaf125d3009ebe50b367111ee820a4d71771`

See more details on using hashes here.

File details

Details for the file ua_gec-2.1.3-py3-none-any.whl.

File metadata

Download URL: ua_gec-2.1.3-py3-none-any.whl
Upload date: Feb 21, 2024
Size: 36.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for ua_gec-2.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b907c561c45babbcd6bc066add8e82e6b7a78ee7c9e2ceabe7828e4a8da1314d`
MD5	`78f4059a890d6ce69680abb1c585c2c2`
BLAKE2b-256	`5bbe7366527adb1982b877ac9f176e1bca31a3104d59937b40ed6f46dd75ac4d`

See more details on using hashes here.

ua-gec 2.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

What's new

Data

Metadata

Annotation format

Train-test split

Statistics

GEC+Fluency

GEC-only

Python library

Getting started

Iterating through corpus

Working with annotations

Multiple annotators

Contributing

Citation

Contacts

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes