A tools to easily load Grammatical Error Correction datasets.

Project description

gec-datasets

This library is to handle datasets of Grammatical Error Correction.

Usage

from gec_datasets import GECDatasets
gec = GECDatasets(
    base_path='datasets/'
)
conll14 = gec.load('conll14')

assert conll14.srcs is not None
assert conll14.refs is not None
# The number of sentences is 1312.
assert len(conll14.srcs) == 1312
# CoNLL-2014 contains two official references.
assert len(conll14.refs) == 2
# Each reference also contains 1312 sentences.
assert len(conll14.refs[0]) == 1312
assert len(conll14.refs[1]) == 1312

The dataset is stored under base_path=.
The first time it is downloaded automatically, and thereafter it is loaded from the saved files.

datasets/
├── conll14
│   ├── ref0.txt
│   ├── ref1.txt
│   └── src.txt
├── bea19-dev
│   ├── ref0.txt
│   ├── src.txt
├── bea19-test
│   └── src.txt
...

Supported datasets

Public datasets

ID `.load(ID)`	Description
'conll13'	CoNLL-2013 test set [Ng+ 2013].
'conll14'	CoNLL-2014 test set [Ng+ 2014].
'jfleg-test'	JFLEG test set [Napoles+ 2017].
'jfleg-dev'	JFLEG development set [Napoles+ 2017].
'fce-test'	FCE test set [Yannakoudakis+ 2011]
'fce-dev'	FCE development set [Yannakoudakis+ 2011].
'fce-train'	FCE training set [Yannakoudakis+ 2011].
'cweb-g-test'	CWEB-G test set [Flachs+ 2020].
'cweb-g-dev'	CWEB-G development set [Flachs+ 2020].
'cweb-s-test'	CWEB-S test set [Flachs+ 2020].
'cweb-s-dev'	CWEB-S development set [Flachs+ 2020].
'bea19-test'	BEA-2019 shared task test set [Bryant+ 2019].
'bea19-dev'	BEA-2019 shared task development set [Bryant+ 2019]. It contains only source sentences.
'wi-locness-train'	W&I+LOCNESS training set [Yannakoudakis+ 2018].

Non-public datasets

ID `.load(ID)`	Description
'nucle-train'	NUCLE training set. [Dahlmeier+ 2013]
'lang8-train'	Lang-8 training set. [Mizumoto+ 2012] [Tajiri+ 2012]

nucle-train

Request data from HERE.
You will receive an email with release3.3.tar.bz2 attached.
mkdir <base_path>/nucle-train/ and put the data as <base_path>/nucle-train/release3.3.tar.bz2.
You can now use the data with .load("nucle-train"). The data will be extracted automatically.

lang8-train

Request data from HERE.
You will receive an email titled "[NAIST Lang-8 Corpus of Learner English for the 14th BEA Shared Task]".
mkdir <base_path>/lang8-train/ and put the data as <base_path>/lang8-train/lang8.bea19.tar.gz.
You can now use the data with .load("lang8-train"). The data will be extracted automatically.

Project details

Release history Release notifications | RSS feed

0.2.0

Dec 14, 2025

0.1.1

Feb 27, 2025

This version

0.1.0

Dec 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gec_datasets-0.1.0.tar.gz (30.3 MB view details)

Uploaded Dec 27, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gec_datasets-0.1.0-py3-none-any.whl (6.1 kB view details)

Uploaded Dec 27, 2024 Python 3

File details

Details for the file gec_datasets-0.1.0.tar.gz.

File metadata

Download URL: gec_datasets-0.1.0.tar.gz
Upload date: Dec 27, 2024
Size: 30.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.11

File hashes

Hashes for gec_datasets-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`cfab77ecf1481de904b33f48e89b9363ff9090cfc094955621deeb0651dcacd2`
MD5	`3c4022ba98e888bc2751f8eb750f5273`
BLAKE2b-256	`27077321892e3e4cdcbcbe8457677ae46969753be52df9cfa72460eaa249e245`

See more details on using hashes here.

File details

Details for the file gec_datasets-0.1.0-py3-none-any.whl.

File metadata

Download URL: gec_datasets-0.1.0-py3-none-any.whl
Upload date: Dec 27, 2024
Size: 6.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.11

File hashes

Hashes for gec_datasets-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b67cdabf0de14e1b8419202bc09d9297323aa898e61c7aa83fbc54b251459107`
MD5	`4f6c9713d003f20e248772dfcac2b60e`
BLAKE2b-256	`f434132cabb4d4afb74f870aeb15d1688b745903bcf4e9533a105dcceca69e8f`

See more details on using hashes here.

gec-datasets 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

gec-datasets

Usage

Supported datasets

Public datasets

Non-public datasets

nucle-train

lang8-train

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes