A tools to easily load Grammatical Error Correction datasets.
Project description
gec-datasets
This library is to handle datasets of Grammatical Error Correction.
Usage
from gec_datasets import GECDatasets
gec = GECDatasets(
base_path='datasets/'
)
conll14 = gec.load('conll14')
assert conll14.srcs is not None
assert conll14.refs is not None
# The number of sentences is 1312.
assert len(conll14.srcs) == 1312
# CoNLL-2014 contains two official references.
assert len(conll14.refs) == 2
# Each reference also contains 1312 sentences.
assert len(conll14.refs[0]) == 1312
assert len(conll14.refs[1]) == 1312
The dataset is stored under base_path=.
The first time it is downloaded automatically, and thereafter it is loaded from the saved files.
datasets/
├── conll14
│ ├── ref0.txt
│ ├── ref1.txt
│ └── src.txt
├── bea19-dev
│ ├── ref0.txt
│ ├── src.txt
├── bea19-test
│ └── src.txt
...
Supported datasets
Public datasets
ID .load(ID) |
Description |
|---|---|
| 'conll13' | CoNLL-2013 test set [Ng+ 2013]. |
| 'conll14' | CoNLL-2014 test set [Ng+ 2014]. |
| 'jfleg-test' | JFLEG test set [Napoles+ 2017]. |
| 'jfleg-dev' | JFLEG development set [Napoles+ 2017]. |
| 'fce-test' | FCE test set [Yannakoudakis+ 2011] |
| 'fce-dev' | FCE development set [Yannakoudakis+ 2011]. |
| 'fce-train' | FCE training set [Yannakoudakis+ 2011]. |
| 'cweb-g-test' | CWEB-G test set [Flachs+ 2020]. |
| 'cweb-g-dev' | CWEB-G development set [Flachs+ 2020]. |
| 'cweb-s-test' | CWEB-S test set [Flachs+ 2020]. |
| 'cweb-s-dev' | CWEB-S development set [Flachs+ 2020]. |
| 'bea19-test' | BEA-2019 shared task test set [Bryant+ 2019]. |
| 'bea19-dev' | BEA-2019 shared task development set [Bryant+ 2019]. It contains only source sentences. |
| 'wi-locness-train' | W&I+LOCNESS training set [Yannakoudakis+ 2018]. |
Non-public datasets
ID .load(ID) |
Description |
|---|---|
| 'nucle-train' | NUCLE training set. [Dahlmeier+ 2013] |
| 'lang8-train' | Lang-8 training set. [Mizumoto+ 2012] [Tajiri+ 2012] |
nucle-train
- Request data from HERE.
- You will receive an email with release3.3.tar.bz2 attached.
mkdir <base_path>/nucle-train/and put the data as<base_path>/nucle-train/release3.3.tar.bz2.- You can now use the data with
.load("nucle-train"). The data will be extracted automatically.
lang8-train
- Request data from HERE.
- You will receive an email titled "[NAIST Lang-8 Corpus of Learner English for the 14th BEA Shared Task]".
mkdir <base_path>/lang8-train/and put the data as<base_path>/lang8-train/lang8.bea19.tar.gz.- You can now use the data with
.load("lang8-train"). The data will be extracted automatically.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gec_datasets-0.1.0.tar.gz
(30.3 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gec_datasets-0.1.0.tar.gz.
File metadata
- Download URL: gec_datasets-0.1.0.tar.gz
- Upload date:
- Size: 30.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfab77ecf1481de904b33f48e89b9363ff9090cfc094955621deeb0651dcacd2
|
|
| MD5 |
3c4022ba98e888bc2751f8eb750f5273
|
|
| BLAKE2b-256 |
27077321892e3e4cdcbcbe8457677ae46969753be52df9cfa72460eaa249e245
|
File details
Details for the file gec_datasets-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gec_datasets-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b67cdabf0de14e1b8419202bc09d9297323aa898e61c7aa83fbc54b251459107
|
|
| MD5 |
4f6c9713d003f20e248772dfcac2b60e
|
|
| BLAKE2b-256 |
f434132cabb4d4afb74f870aeb15d1688b745903bcf4e9533a105dcceca69e8f
|