Unified Word Sense Disambiguation benchmark datasets, normalized to modern wn lexicon sense IDs (oewn:2024 and omw:*).
Project description
pretty_name: pywsd-datasets license: mit task_categories:
- token-classification language:
- en tags:
- word-sense-disambiguation
- wsd
- wordnet
- oewn
- semcor
- semeval
- senseval configs:
- config_name: en-senseval2-aw
data_files:
- split: test path: data/en-senseval2-aw/test.parquet
- config_name: en-senseval3-aw
data_files:
- split: test path: data/en-senseval3-aw/test.parquet
- config_name: en-semeval2007-aw
data_files:
- split: test path: data/en-semeval2007-aw/test.parquet
- config_name: en-semeval2013-aw
data_files:
- split: test path: data/en-semeval2013-aw/test.parquet
- config_name: en-semeval2015-aw
data_files:
- split: test path: data/en-semeval2015-aw/test.parquet
- config_name: en-semcor
data_files:
- split: train path: data/en-semcor/train.parquet
- config_name: en-wngt
data_files:
- split: train path: data/en-wngt/train.parquet
- config_name: en-masc
data_files:
- split: train path: data/en-masc/train.parquet
- config_name: en-senseval2_ls
data_files:
- split: train path: data/en-senseval2_ls/train.parquet
- split: test path: data/en-senseval2_ls/test.parquet
- config_name: en-senseval3_ls
data_files:
- split: train path: data/en-senseval3_ls/train.parquet
- split: test path: data/en-senseval3_ls/test.parquet
- config_name: en-semeval2007_t17_ls
data_files:
- split: test path: data/en-semeval2007_t17_ls/test.parquet
pywsd-datasets
Unified Word Sense Disambiguation benchmark datasets, normalized to modern
wn lexicon sense IDs (oewn:2024 for English, OMW for other languages).
Companion to pywsd ≥ 1.3.0.
What's shipped (v0.2)
English, test-only Raganato all-words benchmark:
| Config | Instances | OEWN 2024 coverage |
|---|---|---|
en-senseval2-aw |
2,282 | 99.43 % |
en-senseval3-aw |
1,850 | 99.51 % |
en-semeval2007-aw |
455 | 99.78 % |
en-semeval2013-aw |
1,644 | 100.00 % |
en-semeval2015-aw |
1,022 | 99.32 % |
English, training corpora (via UFSAC v2.1):
| Config | Split | OEWN 2024 coverage |
|---|---|---|
en-semcor |
train | see coverage_report |
en-wngt |
train | see coverage_report |
en-masc |
train | see coverage_report |
en-senseval2_ls |
train + test | lexical-sample |
en-senseval3_ls |
train + test | lexical-sample |
en-semeval2007_t17_ls |
test | lexical-sample |
Run python -m pywsd_datasets.scripts.coverage_report locally to get
up-to-date OEWN resolution rates after rebuilding.
Install
pip install pywsd-datasets
Use via HuggingFace datasets
from datasets import load_dataset
# Raganato all-words evaluation set
ds = load_dataset("alvations/pywsd-datasets", "en-senseval2-aw")
# SemCor training data
ds = load_dataset("alvations/pywsd-datasets", "en-semcor")
ds["test"][0] if "test" in ds else ds["train"][0]
# {'instance_id': 'd000.s000.t000', 'dataset': 'senseval2_aw',
# 'split': 'test', 'lang': 'en',
# 'tokens': ['The', 'art', 'of', 'change-ringing', ...],
# 'target_idx': 1, 'target_lemma': 'art', 'target_pos': 'n',
# 'source_sense_id': 'art%1:09:00::',
# 'source_sense_system': 'pwn_sensekey_3.0',
# 'sense_ids_wordnet': ['oewn-05646832-n'],
# 'wordnet_lexicon': 'oewn:2024', ...}
Use via the loader package
from pywsd_datasets.loaders.raganato import iter_instances as iter_raganato
from pywsd_datasets.loaders.ufsac import iter_instances as iter_ufsac
for inst in iter_raganato("senseval2"):
print(inst.target_lemma, inst.sense_ids_wordnet)
for inst in iter_ufsac("semcor", "/path/to/ufsac-public-2.1"):
print(inst.target_lemma, inst.sense_ids_wordnet)
Rebuild locally
pip install pywsd-datasets[dev]
# Raganato only (always works, ~1 MB fetch from our GH release mirror)
python -m pywsd_datasets.scripts.build_all
# With UFSAC corpora — download ufsac-public-2.1 separately (see below)
python -m pywsd_datasets.scripts.build_all \
--ufsac-root ~/.cache/pywsd-datasets/ufsac/ufsac-public-2.1
# Coverage report across every built parquet:
python -m pywsd_datasets.scripts.coverage_report
UFSAC download
UFSAC v2.1 is distributed as a single Google Drive tarball
(ufsac-public-2.1.tar.xz, ~196 MB). Fetch with gdown:
pip install gdown
mkdir -p ~/.cache/pywsd-datasets/ufsac
gdown 'https://drive.google.com/uc?id=1kwBMIDBTf6heRno9bdLvF-DahSLHIZyV' \
-O ~/.cache/pywsd-datasets/ufsac/ufsac-public-2.1.tar.xz
cd ~/.cache/pywsd-datasets/ufsac && tar -xf ufsac-public-2.1.tar.xz
Schema
Every row follows WSDInstance:
instance_id, dataset, split, task, lang,
tokens[], pos_tags[], lemmas[],
target_idx, target_lemma, target_pos,
source_sense_id, source_sense_system,
sense_ids_wordnet[], wordnet_lexicon,
doc_id, sent_id
sense_ids_wordnet is list-valued to handle multi-gold instances and any
PWN-3.0 key that splits into multiple OEWN 2024 synsets.
Multilingual / XL-WSD / BabelNet — deferred
loaders/xl_wsd.py exists as a stub and raises NotImplementedError.
mappers/babelnet_to_wn.py is similarly unused. Why:
- XL-WSD uses BabelNet synset IDs as gold labels; resolving them to
modern
wnlexicon IDs requires the BabelNet → PWN 3.0 bridge file, which is distributed only with a BabelNet academic license. - XL-WSD itself is CC-BY-NC 4.0 — we don't redistribute the data.
Reviving this path requires (a) a BabelNet license, (b) loading
bn_to_wn.txt via babelnet_to_wn.load_bn_to_pwn3_map(), (c) selecting
per-language OMW lexicons via mappers.omw_lookup.lexicon_for(lang),
then (d) chaining through pwn3_to_oewn.pwn3_sensekey_to_wn(key, lexicon=...).
All four pieces are in place — wiring them is blocked on the BabelNet
mapping file. See the module docstrings for details.
Roadmap
- v0.2 (this release): Raganato all-words evaluation + UFSAC training corpora (SemCor, WNGT, MASC, Senseval lexical-sample).
- v0.3 (planned): WiC (CC-BY-NC — loader-only), CoarseWSD-20.
- Deferred: XL-WSD multilingual (needs BabelNet academic license).
Citation
If you use these datasets please cite the original sources:
- Raganato, Camacho-Collados, Navigli (2017). Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. EACL.
- Vial, Lecouteux, Schwab (2018). UFSAC: Unification of Sense Annotated Corpora and Tools. LREC.
- Plus the specific evaluation or training set paper (Senseval-2 / 3, SemEval-2007 T17, SemEval-2013 T12, SemEval-2015 T13, SemCor, WNGT/Princeton Gloss Corpus, MASC).
License
MIT for the code. Each dataset keeps its original license — see the source papers. Raganato bundle and SemEval shared-task data are research-unrestricted; UFSAC is MIT.
Sense-ID mapping details
PWN 3.0 sense keys are resolved against OEWN 2024 via
wn.compat.sensekey. The few percent of
keys that fail to resolve are typically WN 3.0 synsets that OEWN split,
merged, or removed — those rows ship with an empty sense_ids_wordnet list
so the coverage report can flag them. Background:
- Kaf (2023). Mapping Wordnets on the Fly with Permanent Sense Keys. arXiv:2303.01847.
Known issues
- The upstream Raganato zip at
http://lcl.uniroma1.it/wsdeval/serves a mismatched TLS cert; our loader prefers the mirror on this repo's GitHub release assets and falls back to the original URL over HTTP. - UFSAC v2.1 is distributed as a Google Drive tarball; the loader assumes you have it unpacked locally. A future release may mirror it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pywsd_datasets-0.2.0.tar.gz.
File metadata
- Download URL: pywsd_datasets-0.2.0.tar.gz
- Upload date:
- Size: 22.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cec773ff26ddb4abc6095574451365f5006eae23f57136e31bc696688f3e8a36
|
|
| MD5 |
cc4e1d99fe3eaf1031017fd0116cc1ff
|
|
| BLAKE2b-256 |
6d6eb5277e83d629dca3965b4067769eeb5ae2644cdf38efe1e3c8218cc92c7c
|
File details
Details for the file pywsd_datasets-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pywsd_datasets-0.2.0-py3-none-any.whl
- Upload date:
- Size: 22.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80be5a140a65cd793c7c23b339d0ef37e0cb485f4bd42d16deeec4759778a2f3
|
|
| MD5 |
76a7a002182f55892a4687bdfee6ce4c
|
|
| BLAKE2b-256 |
8802f6b7ff8c3ccd392ce00f5651b6f3eacb0cff21a43bde09e027c655c95f2f
|