A tools to easily load Grammatical Error Correction datasets.
Project description
gec-datasets
This library is to handle datasets of Grammatical Error Correction.
Install
pip install gec-datasets
Usage
API
from gec_datasets import GECDatasets
gec = GECDatasets(
base_path='gec_datasets_base/'
)
conll14 = gec.load('conll14')
assert conll14.srcs is not None
assert conll14.refs is not None
# The number of sentences is 1312.
assert len(conll14.srcs) == 1312
# CoNLL-2014 contains two official references.
assert len(conll14.refs) == 2
# Each reference also contains 1312 sentences.
assert len(conll14.refs[0]) == 1312
assert len(conll14.refs[1]) == 1312
Available ids can be found by:
import gec_datasets
print(gec_datasets.available())
CLI
You can specify multiple ids of the data you want to download in the --ids field.
gecdatasets-download --base_path "gec_datasets_base/" --ids conll14 bea19-dev
Available ids can be found by:
gecdatasets-available
In both API and CLI, datasets will be stored under base_path=.
The first time it is downloaded automatically, and thereafter it is loaded from the saved files.
When you call gec.load('sample'), gec-datasets simply refers to <base_path>/'sample'/{src.txt|ref0.txt|...}.
gec_datasets_base/
├── conll14
│ ├── ref0.txt
│ ├── ref1.txt
│ └── src.txt
├── bea19-dev
│ ├── ref0.txt
│ ├── src.txt
├── bea19-test
│ └── src.txt
...
Supported datasets
Public datasets
ID .load(ID) |
Description |
|---|---|
| 'conll13' | CoNLL-2013 test set [Ng+ 2013]. |
| 'conll14' | CoNLL-2014 test set [Ng+ 2014]. |
| 'jfleg-test' | JFLEG test set [Napoles+ 2017]. |
| 'jfleg-dev' | JFLEG development set [Napoles+ 2017]. |
| 'fce-test' | FCE test set [Yannakoudakis+ 2011] |
| 'fce-dev' | FCE development set [Yannakoudakis+ 2011]. |
| 'fce-train' | FCE training set [Yannakoudakis+ 2011]. |
| 'cweb-g-test' | CWEB-G test set [Flachs+ 2020]. |
| 'cweb-g-dev' | CWEB-G development set [Flachs+ 2020]. |
| 'cweb-s-test' | CWEB-S test set [Flachs+ 2020]. |
| 'cweb-s-dev' | CWEB-S development set [Flachs+ 2020]. |
| 'bea19-test' | BEA-2019 shared task test set [Bryant+ 2019]. |
| 'bea19-dev' | BEA-2019 shared task development set [Bryant+ 2019]. It contains only source sentences. |
| 'wi-locness-train' | W&I+LOCNESS training set [Yannakoudakis+ 2018]. |
The following is synthetic data.
ID .load(ID) |
Description |
|---|---|
| 'troy-1bw-train' | Synthetic data based on the One Billion Words Benchmark for distillation [Tarnavskyi,+ 2022]. |
| 'troy-1bw-dev' | Another split of the synthetic data based on the One Billion Words Benchmark for distillation [Tarnavskyi,+ 2022]. |
| 'troy-blogs-train' | Synthetic data based on the Blog Authorship Corpus for distillation [Tarnavskyi,+ 2022]. |
| 'troy-blogs-dev' | Another split of the synthetic data based on the Blog Authorship Corpus for distillation [Tarnavskyi,+ 2022]. |
| 'pie-synthetic-a1' | Synthetic data based on the One Billion Words Benchmark [Awasthi+ 19]. You can also specify a2, a3, a4, and a5. This attachment describes how to make synthetic errors. |
Non-public datasets
ID .load(ID) |
Description |
|---|---|
| 'nucle-train' | NUCLE training set. [Dahlmeier+ 2013] |
| 'lang8-train' | Lang-8 training set. [Mizumoto+ 2012] [Tajiri+ 2012] |
nucle-train
- Request data from HERE.
- You will receive an email with release3.3.tar.bz2 attached.
mkdir <base_path>/nucle/and put the data as<base_path>/nucle/release3.3.tar.bz2.- You can now use the data with
.load("nucle-train"). The data will be extracted automatically.
lang8-train
- Request data from HERE.
- You will receive an email titled "[NAIST Lang-8 Corpus of Learner English for the 14th BEA Shared Task]".
mkdir <base_path>/lang8/and put the data as<base_path>/lang8/lang8.bea19.tar.gz.- You can now use the data with
.load("lang8-train"). The data will be extracted automatically.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gec_datasets-0.2.0.tar.gz.
File metadata
- Download URL: gec_datasets-0.2.0.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b173d56bf8066b70f56af7a98e8985bf8975510f8b82bc3e307dc7780ca71e7
|
|
| MD5 |
3d9bd722dce28c348d9d2fb19a8b0b9d
|
|
| BLAKE2b-256 |
255f31d3604928bcc79f79204019034a95344e1dddf1ab2485598034498dac19
|
File details
Details for the file gec_datasets-0.2.0-py3-none-any.whl.
File metadata
- Download URL: gec_datasets-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9b22bc0c132b58db9ee109b4bf6116682814a8709dbbb5d3700da8055c0d035
|
|
| MD5 |
c1c4a59eae0e085cfee95d257707673d
|
|
| BLAKE2b-256 |
6a9f6923752e36023c6a1ee7a3c4a06b2d80777ed0ca7f3aa3aec14cd70ab90d
|