Skip to main content

A tools to easily load Grammatical Error Correction datasets.

Project description

gec-datasets

This library is to handle datasets of Grammatical Error Correction.

Install

pip install gec-datasets

Usage

API

from gec_datasets import GECDatasets
gec = GECDatasets(
    base_path='gec_datasets_base/'
)
conll14 = gec.load('conll14')

assert conll14.srcs is not None
assert conll14.refs is not None
# The number of sentences is 1312.
assert len(conll14.srcs) == 1312
# CoNLL-2014 contains two official references.
assert len(conll14.refs) == 2
# Each reference also contains 1312 sentences.
assert len(conll14.refs[0]) == 1312
assert len(conll14.refs[1]) == 1312

Available ids can be found by:

import gec_datasets
print(gec_datasets.available())

CLI

You can specify multiple ids of the data you want to download in the --ids field.

gecdatasets-download --base_path "gec_datasets_base/" --ids conll14 bea19-dev

Available ids can be found by:

gecdatasets-available

In both API and CLI, datasets will be stored under base_path=.
The first time it is downloaded automatically, and thereafter it is loaded from the saved files.

When you call gec.load('sample'), gec-datasets simply refers to <base_path>/'sample'/{src.txt|ref0.txt|...}.

gec_datasets_base/
├── conll14
│   ├── ref0.txt
│   ├── ref1.txt
│   └── src.txt
├── bea19-dev
│   ├── ref0.txt
│   ├── src.txt
├── bea19-test
│   └── src.txt
...

Supported datasets

Public datasets

ID .load(ID) Description
'conll13' CoNLL-2013 test set [Ng+ 2013].
'conll14' CoNLL-2014 test set [Ng+ 2014].
'jfleg-test' JFLEG test set [Napoles+ 2017].
'jfleg-dev' JFLEG development set [Napoles+ 2017].
'fce-test' FCE test set [Yannakoudakis+ 2011]
'fce-dev' FCE development set [Yannakoudakis+ 2011].
'fce-train' FCE training set [Yannakoudakis+ 2011].
'cweb-g-test' CWEB-G test set [Flachs+ 2020].
'cweb-g-dev' CWEB-G development set [Flachs+ 2020].
'cweb-s-test' CWEB-S test set [Flachs+ 2020].
'cweb-s-dev' CWEB-S development set [Flachs+ 2020].
'bea19-test' BEA-2019 shared task test set [Bryant+ 2019].
'bea19-dev' BEA-2019 shared task development set [Bryant+ 2019]. It contains only source sentences.
'wi-locness-train' W&I+LOCNESS training set [Yannakoudakis+ 2018].

The following is synthetic data.

ID .load(ID) Description
'troy-1bw-train' Synthetic data based on the One Billion Words Benchmark for distillation [Tarnavskyi,+ 2022].
'troy-1bw-dev' Another split of the synthetic data based on the One Billion Words Benchmark for distillation [Tarnavskyi,+ 2022].
'troy-blogs-train' Synthetic data based on the Blog Authorship Corpus for distillation [Tarnavskyi,+ 2022].
'troy-blogs-dev' Another split of the synthetic data based on the Blog Authorship Corpus for distillation [Tarnavskyi,+ 2022].
'pie-synthetic-a1' Synthetic data based on the One Billion Words Benchmark [Awasthi+ 19]. You can also specify a2, a3, a4, and a5. This attachment describes how to make synthetic errors.

Non-public datasets

ID .load(ID) Description
'nucle-train' NUCLE training set. [Dahlmeier+ 2013]
'lang8-train' Lang-8 training set. [Mizumoto+ 2012] [Tajiri+ 2012]

nucle-train

  • Request data from HERE.
  • You will receive an email with release3.3.tar.bz2 attached.
  • mkdir <base_path>/nucle/ and put the data as <base_path>/nucle/release3.3.tar.bz2.
  • You can now use the data with .load("nucle-train"). The data will be extracted automatically.

lang8-train

  • Request data from HERE.
  • You will receive an email titled "[NAIST Lang-8 Corpus of Learner English for the 14th BEA Shared Task]".
  • mkdir <base_path>/lang8/ and put the data as <base_path>/lang8/lang8.bea19.tar.gz.
  • You can now use the data with .load("lang8-train"). The data will be extracted automatically.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gec_datasets-0.2.0.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gec_datasets-0.2.0-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file gec_datasets-0.2.0.tar.gz.

File metadata

  • Download URL: gec_datasets-0.2.0.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.11

File hashes

Hashes for gec_datasets-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6b173d56bf8066b70f56af7a98e8985bf8975510f8b82bc3e307dc7780ca71e7
MD5 3d9bd722dce28c348d9d2fb19a8b0b9d
BLAKE2b-256 255f31d3604928bcc79f79204019034a95344e1dddf1ab2485598034498dac19

See more details on using hashes here.

File details

Details for the file gec_datasets-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for gec_datasets-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f9b22bc0c132b58db9ee109b4bf6116682814a8709dbbb5d3700da8055c0d035
MD5 c1c4a59eae0e085cfee95d257707673d
BLAKE2b-256 6a9f6923752e36023c6a1ee7a3c4a06b2d80777ed0ca7f3aa3aec14cd70ab90d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page