Unified Word Sense Disambiguation benchmark datasets, normalized to modern wn lexicon sense IDs (oewn:2024 and omw:*).

These details have not been verified by PyPI

Project links

Project description

pretty_name: pywsd-datasets license: mit task_categories:

token-classification language:
en tags:
word-sense-disambiguation
wsd
wordnet
oewn
semeval
senseval size_categories:
1K<n<10K configs:
config_name: en-senseval2-aw data_files:
- split: test path: data/en-senseval2-aw/test.parquet
config_name: en-senseval3-aw data_files:
- split: test path: data/en-senseval3-aw/test.parquet
config_name: en-semeval2007-aw data_files:
- split: test path: data/en-semeval2007-aw/test.parquet
config_name: en-semeval2013-aw data_files:
- split: test path: data/en-semeval2013-aw/test.parquet
config_name: en-semeval2015-aw data_files:
- split: test path: data/en-semeval2015-aw/test.parquet

pywsd-datasets

Unified Word Sense Disambiguation benchmark datasets, normalized to modern wn lexicon sense IDs (oewn:2024 for English, OMW for other languages).

Companion to pywsd ≥ 1.3.0.

What's in v0.1 (English-only, expanding)

Config	Source	Instances	OEWN 2024 coverage
`en-senseval2-aw`	Raganato et al. 2017	2,282	99.43 %
`en-senseval3-aw`	Raganato et al. 2017	1,850	99.51 %
`en-semeval2007-aw`	Raganato et al. 2017	455	99.78 %
`en-semeval2013-aw`	Raganato et al. 2017	1,644	100.00 %
`en-semeval2015-aw`	Raganato et al. 2017	1,022	99.32 %
Total		7,253	99.59 %

Install

pip install pywsd-datasets

Use via HuggingFace `datasets`

from datasets import load_dataset
ds = load_dataset("alvations/pywsd-datasets", "en-senseval2-aw")
ds["test"][0]
# {'instance_id': 'd000.s000.t000', 'dataset': 'senseval2_aw',
#  'split': 'test', 'lang': 'en',
#  'tokens': ['The', 'art', 'of', 'change-ringing', ...],
#  'target_idx': 1, 'target_lemma': 'art', 'target_pos': 'n',
#  'source_sense_id': 'art%1:09:00::',
#  'source_sense_system': 'pwn_sensekey_3.0',
#  'sense_ids_wordnet': ['oewn-05646832-n'],
#  'wordnet_lexicon': 'oewn:2024', ...}

Use via the loader package

from pywsd_datasets.loaders.raganato import iter_instances
for inst in iter_instances("senseval2"):
    print(inst.target_lemma, inst.sense_ids_wordnet)

Rebuild locally

pip install pywsd-datasets[dev]
python -m pywsd_datasets.scripts.build_all              # parquet in data/
python -m pywsd_datasets.scripts.coverage_report        # OEWN 2024 resolution %

Schema

Every row follows WSDInstance:

instance_id, dataset, split, task, lang,
tokens[], pos_tags[], lemmas[],
target_idx, target_lemma, target_pos,
source_sense_id, source_sense_system,
sense_ids_wordnet[], wordnet_lexicon,
doc_id, sent_id

sense_ids_wordnet is list-valued to handle multi-gold instances and any PWN-3.0 key that splits into multiple OEWN 2024 synsets.

Roadmap

v0.2 — UFSAC corpora (SemCor, WNGT, MASC, OMSTI, Senseval lexical-sample).
v0.3 — WiC (CC-BY-NC), CoarseWSD-20.
v0.4 — XL-WSD multilingual via BabelNet → WordNet → OMW. Initial languages: it, de, fr, es, ja (then the full XL-WSD 18 over time).

Loader stubs already live under src/pywsd_datasets/loaders/ for ufsac and xl_wsd.

Citation

If you use these datasets please cite the original sources:

Raganato, Camacho-Collados, Navigli (2017). Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. EACL.
Plus the paper for whichever specific evaluation set you use (Senseval-2 / 3, SemEval-2007 T17, SemEval-2013 T12, SemEval-2015 T13).

License

MIT for the code. Each dataset keeps its original license — see the source papers. Raganato bundle and SemEval shared-task data are research-unrestricted.

Sense-ID mapping details

PWN 3.0 sense keys are resolved against OEWN 2024 via wn.compat.sensekey. The few percent of keys that fail to resolve are typically WN 3.0 synsets that OEWN split, merged, or removed — those rows ship with an empty sense_ids_wordnet list so the coverage report can flag them. Background:

Kaf (2023). Mapping Wordnets on the Fly with Permanent Sense Keys. arXiv:2303.01847.

Known issues

lcl.uniroma1.it serves a mismatched TLS cert; the loader fetches the Raganato bundle over HTTP. When this repo matures we'll mirror the zip to the HF Hub release assets.
UFSAC v2.1 distribution is a Google Drive zip last updated 2018; same mirror plan applies once licensing is confirmed.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Apr 17, 2026

This version

0.1.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywsd_datasets-0.1.0.tar.gz (17.1 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pywsd_datasets-0.1.0-py3-none-any.whl (18.2 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file pywsd_datasets-0.1.0.tar.gz.

File metadata

Download URL: pywsd_datasets-0.1.0.tar.gz
Upload date: Apr 17, 2026
Size: 17.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for pywsd_datasets-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`396a4c93a36ca1490c672e98e750fd94ba90dc9a671b1b4f294a07ec059277a9`
MD5	`eee669dd52de730d81c39bdb247f0c8b`
BLAKE2b-256	`0863522086fc1a94966fc5848b5ba1fabec53388aaac9c7f12286bb8f0865e51`

See more details on using hashes here.

File details

Details for the file pywsd_datasets-0.1.0-py3-none-any.whl.

File metadata

Download URL: pywsd_datasets-0.1.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 18.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for pywsd_datasets-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a776c4cb057a462c4f52ce492fefe010d378c18585051bdc6c2c78d4daed78ac`
MD5	`23f59ad80d91e7660373752870b812fd`
BLAKE2b-256	`d338243e8246612ffcc7e3ec7748ed8347618d751ac820ad48cfe63348b47de2`

See more details on using hashes here.

pywsd-datasets 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pywsd-datasets

What's in v0.1 (English-only, expanding)

Install

Use via HuggingFace `datasets`

Use via the loader package

Rebuild locally

Schema

Roadmap

Citation

License

Sense-ID mapping details

Known issues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

pywsd-datasets 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pywsd-datasets

What's in v0.1 (English-only, expanding)

Install

Use via HuggingFace datasets

Use via the loader package

Rebuild locally

Schema

Roadmap

Citation

License

Sense-ID mapping details

Known issues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Use via HuggingFace `datasets`