Skip to main content

Unified Word Sense Disambiguation benchmark datasets, normalized to modern wn lexicon sense IDs (oewn:2024 and omw:*).

Project description


pretty_name: pywsd-datasets license: mit task_categories:

  • token-classification language:
  • en tags:
  • word-sense-disambiguation
  • wsd
  • wordnet
  • oewn
  • semeval
  • senseval size_categories:
  • 1K<n<10K configs:
  • config_name: en-senseval2-aw data_files:
    • split: test path: data/en-senseval2-aw/test.parquet
  • config_name: en-senseval3-aw data_files:
    • split: test path: data/en-senseval3-aw/test.parquet
  • config_name: en-semeval2007-aw data_files:
    • split: test path: data/en-semeval2007-aw/test.parquet
  • config_name: en-semeval2013-aw data_files:
    • split: test path: data/en-semeval2013-aw/test.parquet
  • config_name: en-semeval2015-aw data_files:
    • split: test path: data/en-semeval2015-aw/test.parquet

pywsd-datasets

Unified Word Sense Disambiguation benchmark datasets, normalized to modern wn lexicon sense IDs (oewn:2024 for English, OMW for other languages).

Companion to pywsd ≥ 1.3.0.

What's in v0.1 (English-only, expanding)

Config Source Instances OEWN 2024 coverage
en-senseval2-aw Raganato et al. 2017 2,282 99.43 %
en-senseval3-aw Raganato et al. 2017 1,850 99.51 %
en-semeval2007-aw Raganato et al. 2017 455 99.78 %
en-semeval2013-aw Raganato et al. 2017 1,644 100.00 %
en-semeval2015-aw Raganato et al. 2017 1,022 99.32 %
Total 7,253 99.59 %

Install

pip install pywsd-datasets

Use via HuggingFace datasets

from datasets import load_dataset
ds = load_dataset("alvations/pywsd-datasets", "en-senseval2-aw")
ds["test"][0]
# {'instance_id': 'd000.s000.t000', 'dataset': 'senseval2_aw',
#  'split': 'test', 'lang': 'en',
#  'tokens': ['The', 'art', 'of', 'change-ringing', ...],
#  'target_idx': 1, 'target_lemma': 'art', 'target_pos': 'n',
#  'source_sense_id': 'art%1:09:00::',
#  'source_sense_system': 'pwn_sensekey_3.0',
#  'sense_ids_wordnet': ['oewn-05646832-n'],
#  'wordnet_lexicon': 'oewn:2024', ...}

Use via the loader package

from pywsd_datasets.loaders.raganato import iter_instances
for inst in iter_instances("senseval2"):
    print(inst.target_lemma, inst.sense_ids_wordnet)

Rebuild locally

pip install pywsd-datasets[dev]
python -m pywsd_datasets.scripts.build_all              # parquet in data/
python -m pywsd_datasets.scripts.coverage_report        # OEWN 2024 resolution %

Schema

Every row follows WSDInstance:

instance_id, dataset, split, task, lang,
tokens[], pos_tags[], lemmas[],
target_idx, target_lemma, target_pos,
source_sense_id, source_sense_system,
sense_ids_wordnet[], wordnet_lexicon,
doc_id, sent_id

sense_ids_wordnet is list-valued to handle multi-gold instances and any PWN-3.0 key that splits into multiple OEWN 2024 synsets.

Roadmap

  • v0.2 — UFSAC corpora (SemCor, WNGT, MASC, OMSTI, Senseval lexical-sample).
  • v0.3 — WiC (CC-BY-NC), CoarseWSD-20.
  • v0.4 — XL-WSD multilingual via BabelNet → WordNet → OMW. Initial languages: it, de, fr, es, ja (then the full XL-WSD 18 over time).

Loader stubs already live under src/pywsd_datasets/loaders/ for ufsac and xl_wsd.

Citation

If you use these datasets please cite the original sources:

  • Raganato, Camacho-Collados, Navigli (2017). Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. EACL.
  • Plus the paper for whichever specific evaluation set you use (Senseval-2 / 3, SemEval-2007 T17, SemEval-2013 T12, SemEval-2015 T13).

License

MIT for the code. Each dataset keeps its original license — see the source papers. Raganato bundle and SemEval shared-task data are research-unrestricted.

Sense-ID mapping details

PWN 3.0 sense keys are resolved against OEWN 2024 via wn.compat.sensekey. The few percent of keys that fail to resolve are typically WN 3.0 synsets that OEWN split, merged, or removed — those rows ship with an empty sense_ids_wordnet list so the coverage report can flag them. Background:

  • Kaf (2023). Mapping Wordnets on the Fly with Permanent Sense Keys. arXiv:2303.01847.

Known issues

  • lcl.uniroma1.it serves a mismatched TLS cert; the loader fetches the Raganato bundle over HTTP. When this repo matures we'll mirror the zip to the HF Hub release assets.
  • UFSAC v2.1 distribution is a Google Drive zip last updated 2018; same mirror plan applies once licensing is confirmed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywsd_datasets-0.1.0.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywsd_datasets-0.1.0-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file pywsd_datasets-0.1.0.tar.gz.

File metadata

  • Download URL: pywsd_datasets-0.1.0.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for pywsd_datasets-0.1.0.tar.gz
Algorithm Hash digest
SHA256 396a4c93a36ca1490c672e98e750fd94ba90dc9a671b1b4f294a07ec059277a9
MD5 eee669dd52de730d81c39bdb247f0c8b
BLAKE2b-256 0863522086fc1a94966fc5848b5ba1fabec53388aaac9c7f12286bb8f0865e51

See more details on using hashes here.

File details

Details for the file pywsd_datasets-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pywsd_datasets-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for pywsd_datasets-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a776c4cb057a462c4f52ce492fefe010d378c18585051bdc6c2c78d4daed78ac
MD5 23f59ad80d91e7660373752870b812fd
BLAKE2b-256 d338243e8246612ffcc7e3ec7748ed8347618d751ac820ad48cfe63348b47de2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page