A tools to easily load Grammatical Error Correction datasets.

Project description

gec-datasets

This library is to handle datasets of Grammatical Error Correction.

Install

pip install gec-datasets

Usage

API

from gec_datasets import GECDatasets
gec = GECDatasets(
    base_path='gec_datasets_base/'
)
conll14 = gec.load('conll14')

assert conll14.srcs is not None
assert conll14.refs is not None
# The number of sentences is 1312.
assert len(conll14.srcs) == 1312
# CoNLL-2014 contains two official references.
assert len(conll14.refs) == 2
# Each reference also contains 1312 sentences.
assert len(conll14.refs[0]) == 1312
assert len(conll14.refs[1]) == 1312

Available ids can be found by:

import gec_datasets
print(gec_datasets.available())

CLI

You can specify multiple ids of the data you want to download in the --ids field.

gecdatasets-download --base_path "gec_datasets_base/" --ids conll14 bea19-dev

Available ids can be found by:

gecdatasets-available

In both API and CLI, datasets will be stored under base_path=.
The first time it is downloaded automatically, and thereafter it is loaded from the saved files.

When you call gec.load('sample'), gec-datasets simply refers to <base_path>/'sample'/{src.txt|ref0.txt|...}.

gec_datasets_base/
├── conll14
│   ├── ref0.txt
│   ├── ref1.txt
│   └── src.txt
├── bea19-dev
│   ├── ref0.txt
│   ├── src.txt
├── bea19-test
│   └── src.txt
...

Supported datasets

Public datasets

ID `.load(ID)`	Description
'conll13'	CoNLL-2013 test set [Ng+ 2013].
'conll14'	CoNLL-2014 test set [Ng+ 2014].
'jfleg-test'	JFLEG test set [Napoles+ 2017].
'jfleg-dev'	JFLEG development set [Napoles+ 2017].
'fce-test'	FCE test set [Yannakoudakis+ 2011]
'fce-dev'	FCE development set [Yannakoudakis+ 2011].
'fce-train'	FCE training set [Yannakoudakis+ 2011].
'cweb-g-test'	CWEB-G test set [Flachs+ 2020].
'cweb-g-dev'	CWEB-G development set [Flachs+ 2020].
'cweb-s-test'	CWEB-S test set [Flachs+ 2020].
'cweb-s-dev'	CWEB-S development set [Flachs+ 2020].
'bea19-test'	BEA-2019 shared task test set [Bryant+ 2019].
'bea19-dev'	BEA-2019 shared task development set [Bryant+ 2019]. It contains only source sentences.
'wi-locness-train'	W&I+LOCNESS training set [Yannakoudakis+ 2018].

The following is synthetic data.

ID `.load(ID)`	Description
'troy-1bw-train'	Synthetic data based on the One Billion Words Benchmark for distillation [Tarnavskyi,+ 2022].
'troy-1bw-dev'	Another split of the synthetic data based on the One Billion Words Benchmark for distillation [Tarnavskyi,+ 2022].
'troy-blogs-train'	Synthetic data based on the Blog Authorship Corpus for distillation [Tarnavskyi,+ 2022].
'troy-blogs-dev'	Another split of the synthetic data based on the Blog Authorship Corpus for distillation [Tarnavskyi,+ 2022].
'pie-synthetic-a1'	Synthetic data based on the One Billion Words Benchmark [Awasthi+ 19]. You can also specify `a2`, `a3`, `a4`, and `a5`. This attachment describes how to make synthetic errors.

Non-public datasets

ID `.load(ID)`	Description
'nucle-train'	NUCLE training set. [Dahlmeier+ 2013]
'lang8-train'	Lang-8 training set. [Mizumoto+ 2012] [Tajiri+ 2012]

nucle-train

Request data from HERE.
You will receive an email with release3.3.tar.bz2 attached.
mkdir <base_path>/nucle/ and put the data as <base_path>/nucle/release3.3.tar.bz2.
You can now use the data with .load("nucle-train"). The data will be extracted automatically.

lang8-train

Request data from HERE.
You will receive an email titled "[NAIST Lang-8 Corpus of Learner English for the 14th BEA Shared Task]".
mkdir <base_path>/lang8/ and put the data as <base_path>/lang8/lang8.bea19.tar.gz.
You can now use the data with .load("lang8-train"). The data will be extracted automatically.

Project details

Release history Release notifications | RSS feed

This version

0.2.0

Dec 14, 2025

0.1.1

Feb 27, 2025

0.1.0

Dec 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gec_datasets-0.2.0.tar.gz (9.3 kB view details)

Uploaded Dec 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gec_datasets-0.2.0-py3-none-any.whl (15.9 kB view details)

Uploaded Dec 14, 2025 Python 3

File details

Details for the file gec_datasets-0.2.0.tar.gz.

File metadata

Download URL: gec_datasets-0.2.0.tar.gz
Upload date: Dec 14, 2025
Size: 9.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.11

File hashes

Hashes for gec_datasets-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6b173d56bf8066b70f56af7a98e8985bf8975510f8b82bc3e307dc7780ca71e7`
MD5	`3d9bd722dce28c348d9d2fb19a8b0b9d`
BLAKE2b-256	`255f31d3604928bcc79f79204019034a95344e1dddf1ab2485598034498dac19`

See more details on using hashes here.

File details

Details for the file gec_datasets-0.2.0-py3-none-any.whl.

File metadata

Download URL: gec_datasets-0.2.0-py3-none-any.whl
Upload date: Dec 14, 2025
Size: 15.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.11

File hashes

Hashes for gec_datasets-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f9b22bc0c132b58db9ee109b4bf6116682814a8709dbbb5d3700da8055c0d035`
MD5	`c1c4a59eae0e085cfee95d257707673d`
BLAKE2b-256	`6a9f6923752e36023c6a1ee7a3c4a06b2d80777ed0ca7f3aa3aec14cd70ab90d`

See more details on using hashes here.

gec-datasets 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

gec-datasets

Install

Usage

API

CLI

Supported datasets

Public datasets

Non-public datasets

nucle-train

lang8-train

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes