Unified Word Sense Disambiguation benchmark datasets, normalized to modern wn lexicon sense IDs (oewn:2024 and omw:*).

These details have not been verified by PyPI

Project links

Project description

pretty_name: pywsd-datasets license: mit task_categories:

token-classification language:
en tags:
word-sense-disambiguation
wsd
wordnet
oewn
semcor
semeval
senseval configs:
config_name: en-senseval2-aw data_files:
- split: test path: data/en-senseval2-aw/test.parquet
config_name: en-senseval3-aw data_files:
- split: test path: data/en-senseval3-aw/test.parquet
config_name: en-semeval2007-aw data_files:
- split: test path: data/en-semeval2007-aw/test.parquet
config_name: en-semeval2013-aw data_files:
- split: test path: data/en-semeval2013-aw/test.parquet
config_name: en-semeval2015-aw data_files:
- split: test path: data/en-semeval2015-aw/test.parquet
config_name: en-semcor data_files:
- split: train path: data/en-semcor/train.parquet
config_name: en-wngt data_files:
- split: train path: data/en-wngt/train.parquet
config_name: en-masc data_files:
- split: train path: data/en-masc/train.parquet
config_name: en-senseval2_ls data_files:
- split: train path: data/en-senseval2_ls/train.parquet
- split: test path: data/en-senseval2_ls/test.parquet
config_name: en-senseval3_ls data_files:
- split: train path: data/en-senseval3_ls/train.parquet
- split: test path: data/en-senseval3_ls/test.parquet
config_name: en-semeval2007_t17_ls data_files:
- split: test path: data/en-semeval2007_t17_ls/test.parquet

pywsd-datasets

Unified Word Sense Disambiguation benchmark datasets, normalized to modern wn lexicon sense IDs (oewn:2024 for English, OMW for other languages).

Companion to pywsd ≥ 1.3.0.

What's shipped (v0.2)

English, test-only Raganato all-words benchmark:

Config	Instances	OEWN 2024 coverage
`en-senseval2-aw`	2,282	99.43 %
`en-senseval3-aw`	1,850	99.51 %
`en-semeval2007-aw`	455	99.78 %
`en-semeval2013-aw`	1,644	100.00 %
`en-semeval2015-aw`	1,022	99.32 %

English, training corpora (via UFSAC v2.1):

Config	Split	OEWN 2024 coverage
`en-semcor`	train	see coverage_report
`en-wngt`	train	see coverage_report
`en-masc`	train	see coverage_report
`en-senseval2_ls`	train + test	lexical-sample
`en-senseval3_ls`	train + test	lexical-sample
`en-semeval2007_t17_ls`	test	lexical-sample

Run python -m pywsd_datasets.scripts.coverage_report locally to get up-to-date OEWN resolution rates after rebuilding.

Install

pip install pywsd-datasets

Use via HuggingFace `datasets`

from datasets import load_dataset

# Raganato all-words evaluation set
ds = load_dataset("alvations/pywsd-datasets", "en-senseval2-aw")

# SemCor training data
ds = load_dataset("alvations/pywsd-datasets", "en-semcor")

ds["test"][0] if "test" in ds else ds["train"][0]
# {'instance_id': 'd000.s000.t000', 'dataset': 'senseval2_aw',
#  'split': 'test', 'lang': 'en',
#  'tokens': ['The', 'art', 'of', 'change-ringing', ...],
#  'target_idx': 1, 'target_lemma': 'art', 'target_pos': 'n',
#  'source_sense_id': 'art%1:09:00::',
#  'source_sense_system': 'pwn_sensekey_3.0',
#  'sense_ids_wordnet': ['oewn-05646832-n'],
#  'wordnet_lexicon': 'oewn:2024', ...}

Use via the loader package

from pywsd_datasets.loaders.raganato import iter_instances as iter_raganato
from pywsd_datasets.loaders.ufsac import iter_instances as iter_ufsac

for inst in iter_raganato("senseval2"):
    print(inst.target_lemma, inst.sense_ids_wordnet)

for inst in iter_ufsac("semcor", "/path/to/ufsac-public-2.1"):
    print(inst.target_lemma, inst.sense_ids_wordnet)

Rebuild locally

pip install pywsd-datasets[dev]

# Raganato only (always works, ~1 MB fetch from our GH release mirror)
python -m pywsd_datasets.scripts.build_all

# With UFSAC corpora — download ufsac-public-2.1 separately (see below)
python -m pywsd_datasets.scripts.build_all \
    --ufsac-root ~/.cache/pywsd-datasets/ufsac/ufsac-public-2.1

# Coverage report across every built parquet:
python -m pywsd_datasets.scripts.coverage_report

UFSAC download

UFSAC v2.1 is distributed as a single Google Drive tarball (ufsac-public-2.1.tar.xz, ~196 MB). Fetch with gdown:

pip install gdown
mkdir -p ~/.cache/pywsd-datasets/ufsac
gdown 'https://drive.google.com/uc?id=1kwBMIDBTf6heRno9bdLvF-DahSLHIZyV' \
    -O ~/.cache/pywsd-datasets/ufsac/ufsac-public-2.1.tar.xz
cd ~/.cache/pywsd-datasets/ufsac && tar -xf ufsac-public-2.1.tar.xz

Schema

Every row follows WSDInstance:

instance_id, dataset, split, task, lang,
tokens[], pos_tags[], lemmas[],
target_idx, target_lemma, target_pos,
source_sense_id, source_sense_system,
sense_ids_wordnet[], wordnet_lexicon,
doc_id, sent_id

sense_ids_wordnet is list-valued to handle multi-gold instances and any PWN-3.0 key that splits into multiple OEWN 2024 synsets.

Multilingual / XL-WSD / BabelNet — deferred

loaders/xl_wsd.py exists as a stub and raises NotImplementedError. mappers/babelnet_to_wn.py is similarly unused. Why:

XL-WSD uses BabelNet synset IDs as gold labels; resolving them to modern wn lexicon IDs requires the BabelNet → PWN 3.0 bridge file, which is distributed only with a BabelNet academic license.
XL-WSD itself is CC-BY-NC 4.0 — we don't redistribute the data.

Reviving this path requires (a) a BabelNet license, (b) loading bn_to_wn.txt via babelnet_to_wn.load_bn_to_pwn3_map(), (c) selecting per-language OMW lexicons via mappers.omw_lookup.lexicon_for(lang), then (d) chaining through pwn3_to_oewn.pwn3_sensekey_to_wn(key, lexicon=...). All four pieces are in place — wiring them is blocked on the BabelNet mapping file. See the module docstrings for details.

Roadmap

v0.2 (this release): Raganato all-words evaluation + UFSAC training corpora (SemCor, WNGT, MASC, Senseval lexical-sample).
v0.3 (planned): WiC (CC-BY-NC — loader-only), CoarseWSD-20.
Deferred: XL-WSD multilingual (needs BabelNet academic license).

Citation

If you use these datasets please cite the original sources:

Raganato, Camacho-Collados, Navigli (2017). Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. EACL.
Vial, Lecouteux, Schwab (2018). UFSAC: Unification of Sense Annotated Corpora and Tools. LREC.
Plus the specific evaluation or training set paper (Senseval-2 / 3, SemEval-2007 T17, SemEval-2013 T12, SemEval-2015 T13, SemCor, WNGT/Princeton Gloss Corpus, MASC).

License

MIT for the code. Each dataset keeps its original license — see the source papers. Raganato bundle and SemEval shared-task data are research-unrestricted; UFSAC is MIT.

Sense-ID mapping details

PWN 3.0 sense keys are resolved against OEWN 2024 via wn.compat.sensekey. The few percent of keys that fail to resolve are typically WN 3.0 synsets that OEWN split, merged, or removed — those rows ship with an empty sense_ids_wordnet list so the coverage report can flag them. Background:

Kaf (2023). Mapping Wordnets on the Fly with Permanent Sense Keys. arXiv:2303.01847.

Known issues

The upstream Raganato zip at http://lcl.uniroma1.it/wsdeval/ serves a mismatched TLS cert; our loader prefers the mirror on this repo's GitHub release assets and falls back to the original URL over HTTP.
UFSAC v2.1 is distributed as a Google Drive tarball; the loader assumes you have it unpacked locally. A future release may mirror it.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 17, 2026

0.1.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywsd_datasets-0.2.0.tar.gz (22.3 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pywsd_datasets-0.2.0-py3-none-any.whl (22.6 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file pywsd_datasets-0.2.0.tar.gz.

File metadata

Download URL: pywsd_datasets-0.2.0.tar.gz
Upload date: Apr 17, 2026
Size: 22.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for pywsd_datasets-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`cec773ff26ddb4abc6095574451365f5006eae23f57136e31bc696688f3e8a36`
MD5	`cc4e1d99fe3eaf1031017fd0116cc1ff`
BLAKE2b-256	`6d6eb5277e83d629dca3965b4067769eeb5ae2644cdf38efe1e3c8218cc92c7c`

See more details on using hashes here.

File details

Details for the file pywsd_datasets-0.2.0-py3-none-any.whl.

File metadata

Download URL: pywsd_datasets-0.2.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 22.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for pywsd_datasets-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80be5a140a65cd793c7c23b339d0ef37e0cb485f4bd42d16deeec4759778a2f3`
MD5	`76a7a002182f55892a4687bdfee6ce4c`
BLAKE2b-256	`8802f6b7ff8c3ccd392ce00f5651b6f3eacb0cff21a43bde09e027c655c95f2f`

See more details on using hashes here.

pywsd-datasets 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pywsd-datasets

What's shipped (v0.2)

Install

Use via HuggingFace `datasets`

Use via the loader package

Rebuild locally

UFSAC download

Schema

Multilingual / XL-WSD / BabelNet — deferred

Roadmap

Citation

License

Sense-ID mapping details

Known issues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

pywsd-datasets 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pywsd-datasets

What's shipped (v0.2)

Install

Use via HuggingFace datasets

Use via the loader package

Rebuild locally

UFSAC download

Schema

Multilingual / XL-WSD / BabelNet — deferred

Roadmap

Citation

License

Sense-ID mapping details

Known issues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Use via HuggingFace `datasets`