Skip to main content

A tools to easily load Grammatical Error Correction datasets.

Project description

gec-datasets

This library is to handle datasets of Grammatical Error Correction.

Usage

from gec_datasets import GECDatasets
gec = GECDatasets(
    base_path='datasets/'
)
conll14 = gec.load('conll14')

assert conll14.srcs is not None
assert conll14.refs is not None
# The number of sentences is 1312.
assert len(conll14.srcs) == 1312
# CoNLL-2014 contains two official references.
assert len(conll14.refs) == 2
# Each reference also contains 1312 sentences.
assert len(conll14.refs[0]) == 1312
assert len(conll14.refs[1]) == 1312

The dataset is stored under base_path=.
The first time it is downloaded automatically, and thereafter it is loaded from the saved files.

datasets/
├── conll14
│   ├── ref0.txt
│   ├── ref1.txt
│   └── src.txt
├── bea19-dev
│   ├── ref0.txt
│   ├── src.txt
├── bea19-test
│   └── src.txt
...

Supported datasets

Public datasets

ID .load(ID) Description
'conll13' CoNLL-2013 test set [Ng+ 2013].
'conll14' CoNLL-2014 test set [Ng+ 2014].
'jfleg-test' JFLEG test set [Napoles+ 2017].
'jfleg-dev' JFLEG development set [Napoles+ 2017].
'fce-test' FCE test set [Yannakoudakis+ 2011]
'fce-dev' FCE development set [Yannakoudakis+ 2011].
'fce-train' FCE training set [Yannakoudakis+ 2011].
'cweb-g-test' CWEB-G test set [Flachs+ 2020].
'cweb-g-dev' CWEB-G development set [Flachs+ 2020].
'cweb-s-test' CWEB-S test set [Flachs+ 2020].
'cweb-s-dev' CWEB-S development set [Flachs+ 2020].
'bea19-test' BEA-2019 shared task test set [Bryant+ 2019].
'bea19-dev' BEA-2019 shared task development set [Bryant+ 2019]. It contains only source sentences.
'wi-locness-train' W&I+LOCNESS training set [Yannakoudakis+ 2018].

Non-public datasets

ID .load(ID) Description
'nucle-train' NUCLE training set. [Dahlmeier+ 2013]
'lang8-train' Lang-8 training set. [Mizumoto+ 2012] [Tajiri+ 2012]

nucle-train

  • Request data from HERE.
  • You will receive an email with release3.3.tar.bz2 attached.
  • mkdir <base_path>/nucle-train/ and put the data as <base_path>/nucle-train/release3.3.tar.bz2.
  • You can now use the data with .load("nucle-train"). The data will be extracted automatically.

lang8-train

  • Request data from HERE.
  • You will receive an email titled "[NAIST Lang-8 Corpus of Learner English for the 14th BEA Shared Task]".
  • mkdir <base_path>/lang8-train/ and put the data as <base_path>/lang8-train/lang8.bea19.tar.gz.
  • You can now use the data with .load("lang8-train"). The data will be extracted automatically.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gec_datasets-0.1.0.tar.gz (30.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gec_datasets-0.1.0-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file gec_datasets-0.1.0.tar.gz.

File metadata

  • Download URL: gec_datasets-0.1.0.tar.gz
  • Upload date:
  • Size: 30.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.11

File hashes

Hashes for gec_datasets-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cfab77ecf1481de904b33f48e89b9363ff9090cfc094955621deeb0651dcacd2
MD5 3c4022ba98e888bc2751f8eb750f5273
BLAKE2b-256 27077321892e3e4cdcbcbe8457677ae46969753be52df9cfa72460eaa249e245

See more details on using hashes here.

File details

Details for the file gec_datasets-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for gec_datasets-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b67cdabf0de14e1b8419202bc09d9297323aa898e61c7aa83fbc54b251459107
MD5 4f6c9713d003f20e248772dfcac2b60e
BLAKE2b-256 f434132cabb4d4afb74f870aeb15d1688b745903bcf4e9533a105dcceca69e8f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page