Word Sense Disambiguation models that have been converted from PyTorch Lightning to PyTorch with HuggingFace Hub Mixin

These details have not been verified by PyPI

Project description

WSD Torch Models

This repository contains the code for the Word Sense Disambiguation (WSD) PyTorch models that have been trained and developed by the UCREL NLP Group at Lancaster University, UK.

Installation

Requires Python 3.10 or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require torch>=2.2,<3.0.

pip install wsd-torch-models

Models with examples

Here we list the various WSD models we have implemented and how to use them.

Bi-Encoder Model (BEM)

An inference only implementation of the Bi-Encoder Model (BEM) for Word Sense Disambiguation from the paper Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encoders. This is a bi-encoder model whereby it encodes the word(s) to disambiguate using the word(s) text context, e.g. whole sentence or document and it will encode every sense definition given and return the most similar sense definition for the given word(s). Unlike the original BEM model we use the same model to encode both the text to disambiguate and the label definitions.

These models were trained using the code from the following GitHub repository https://github.com/UCREL/experimental-wsd and ported over to this library for inference only use with easy saving and loading from the HuggingFace hub.

We currently have 4 pre-trained BEM models that predict sense labels from the USAS sense inventory which contains 232 sense categories, which in comparison to WordNet is very coarse (WordNet has approximately 117,000 senses), more details about these models and how they were trained can be found in our forthcoming paper:

ucrelnlp/PyMUSAS-Neural-Engish-Small-BEM - 17 million parameter English only model.
ucrelnlp/PyMUSAS-Neural-Engish-Base-BEM - 68 million parameter English only model.
ucrelnlp/PyMUSAS-Neural-Multilingual-Small-BEM - 140 million parameter Multilingual model.
ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM - 307 million parameter Multilingual model.

Of which an example of how to run them can be found below, this particular example uses the Small English model:

from transformers import AutoTokenizer
import torch

from wsd_torch_models.bem import BEM


if __name__ == "__main__": 
    wsd_model_name = "ucrelnlp/PyMUSAS-Neural-Engish-Small-BEM"
    wsd_model = BEM.from_pretrained(wsd_model_name)
    tokenizer = AutoTokenizer.from_pretrained(wsd_model_name)

    wsd_model.eval()
    # Change this to the device you would like to use, e.g. cpu
    model_device = "cpu"
    wsd_model.to(device=model_device)
    
    sentence = "The river bank was full of fish"
    sentence_tokens = sentence.split()
    
    with torch.inference_mode(mode=True):
        # sub_word_tokenizer can be None when None it will download the appropriate tokenizer
        # but generally it is better to give it the tokenizer as it saves the operation
        # of checking if the tokenizer is already downloaded.
        predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
        
        for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
            print(f"Token: {sentence_token}")
            print("Most likely tags: ")
            for tag in semantic_tags:
                tag_definition = wsd_model.label_to_definition[tag]
                print(f"\t{tag}: {tag_definition}")
            print()

Output from running the code above:

Token: The
Most likely tags: 
        Z5: title: Grammatical bin description: Prepositions/adverbs/conjunctions, etc
        Z3: title: Other proper names description: Nouns that distinguish/identify a product, company, etc. (note – also includes acronyms)
        Z1: title: Personal names description: Nouns that distinguish/identify an individual (e.g. a first name and/or surname, a title of address)
        Z2: title: Geographical names description: Nouns that distinguish/identify a specific place (e.g. the name of a road, a city, a country, a continent, etc.)
        A7: title: Definite (+ modals) description: Abstract terms of modality (possibility, necessity, certainty, etc.)

Token: river
Most likely tags: 
        M4: title: Means of transport (Water) description: Terms depicting means of transport/ways of transporting and/or travelling (by water)
        W3: title: Geographical terms description: Geographical terms
        Z1: title: Personal names description: Nouns that distinguish/identify an individual (e.g. a first name and/or surname, a title of address)
        L2: title: Living creatures generally description: Terms relating to living creatures (e.g. non-human)
        Z2: title: Geographical names description: Nouns that distinguish/identify a specific place (e.g. the name of a road, a city, a country, a continent, etc.)

Token: bank
Most likely tags: 
        M4: title: Means of transport (Water) description: Terms depicting means of transport/ways of transporting and/or travelling (by water)
        I1: title: Money generally description: Terms relating to money generally
        Z1: title: Personal names description: Nouns that distinguish/identify an individual (e.g. a first name and/or surname, a title of address)
        Z2: title: Geographical names description: Nouns that distinguish/identify a specific place (e.g. the name of a road, a city, a country, a continent, etc.)
        W3: title: Geographical terms description: Geographical terms

Token: was
Most likely tags: 
        M4: title: Means of transport (Water) description: Terms depicting means of transport/ways of transporting and/or travelling (by water)
        W3: title: Geographical terms description: Geographical terms
        M3: title: Means of transport (Land) description: Terms depicting means of transport/ways of transporting and/or travelling (on land)
        K6: title: Children’s games and toys description: Terms relating to children’s games and toys
        H1: title: Architecture & kinds of houses & buildings description: Terms relating to buildings/habitats of various kinds, and their construction

Token: full
Most likely tags: 
        M4: title: Means of transport (Water) description: Terms depicting means of transport/ways of transporting and/or travelling (by water)
        Z1: title: Personal names description: Nouns that distinguish/identify an individual (e.g. a first name and/or surname, a title of address)
        W3: title: Geographical terms description: Geographical terms
        L3: title: Plants description: Terms relating to plants and plant-life
        Z3: title: Other proper names description: Nouns that distinguish/identify a product, company, etc. (note – also includes acronyms)

Token: of
Most likely tags: 
        Z1: title: Personal names description: Nouns that distinguish/identify an individual (e.g. a first name and/or surname, a title of address)
        Z3: title: Other proper names description: Nouns that distinguish/identify a product, company, etc. (note – also includes acronyms)
        Z2: title: Geographical names description: Nouns that distinguish/identify a specific place (e.g. the name of a road, a city, a country, a continent, etc.)
        O1.1: title: Substances and materials generally: Solid description: Terms depicting solid substances/materials
        L3: title: Plants description: Terms relating to plants and plant-life

Token: fish
Most likely tags: 
        L2: title: Living creatures generally description: Terms relating to living creatures (e.g. non-human)
        O1.1: title: Substances and materials generally: Solid description: Terms depicting solid substances/materials
        Z1: title: Personal names description: Nouns that distinguish/identify an individual (e.g. a first name and/or surname, a title of address)
        O2: title: Objects generally description: Terms relating to objects generally
        Z3: title: Other proper names description: Nouns that distinguish/identify a product, company, etc. (note – also includes acronyms)

NOTE: the pre-trained models we have released come with the sense definitions they have been trained to predict, USAS sense definitions, if you would like to use a different list/set of sense definitions please look at the wsd_torch_models.bem.BEM.embed_and_set_label_definitions method which will allow you to change the sense definitions the model will predict. We have not tested how well these models will perform on zero shot sense prediction, e.g. training on one sense inventory and predicting on data using a different sense inventory.

Training Data (BEM)

All of these models have been trained on a portion of the ucrelnlp/English-USAS-Mosaico, specifically data/wikipedia_shard_0.jsonl.gz, which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.

Model Architecture (BEM)

Parameter	17M English	68M English	140M Multilingual	307M Multilingual
Layers	7	19	22	22
Hidden Size	256	512	384	768
Intermediate Size	384	768	1152	1152
Attention Heads	4	8	6	12
Total Parameters	17M	68M	140M	307M
Non-embedding Parameters	3.9M	42.4M	42M	110M
Max Sequence Length	8,000	8,000	8,192	8,192
Vocabulary Size	50,368	50,368	256,000	256,000
Tokenizer	ModernBERT	ModernBERT	Gemma 2	Gemma 2

Evaluation (BEM)

We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.

Dataset	17M English	68M English	140M Multilingual	307M Multilingual
Top 1
Chinese	-	-	42.2	47.9
English	66.4	70.1	66.0	70.2
Finnish	-	-	15.8	25.9
Irish	-	-	28.5	35.6
Welsh	-	-	21.7	42.0
Top 5
Chinese	-	-	66.3	70.4
English	87.6	90.0	88.9	90.1
Finnish	-	-	32.8	42.4
Irish	-	-	47.6	51.6
Welsh	-	-	40.8	56.4

The publicly available datasets can be found on HuggingFace Hub ucrelnlp/USAS-WSD.

Note the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.

Development

Setup

You can either use the dev container with your favourite editor, e.g. VSCode. Or you can create your setup locally below we demonstrate both.

In both cases they share the same tools, of which these tools are:

uv for Python packaging and development
make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.

Dev Container

A dev container uses a docker container to create the required development environment, the Dockerfile we use for this dev container can be found at ./.devcontainer/Dockerfile. To run it locally it requires docker to be installed, you can also run it in a cloud based code editor, for a list of supported editors/cloud editors see the following webpage.

To run for the first time on a local VSCode editor (a slightly more detailed and better guide on the VSCode website):

Ensure docker is running.
Ensure the VSCode Dev Containers extension is installed in your VSCode editor.
Open the command pallete CMD + SHIFT + P and then select Dev Containers: Rebuild and Reopen in Container

You should now have everything you need to develop, uv, make, for VSCode various extensions like Pylance, etc.

If you have any trouble see the VSCode website..

Local

To run locally first ensure you have the following tools installed locally:

uv for Python packaging and development. (version 0.9.9)
make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
- Ubuntu: apt-get install make
- Mac: Xcode command line tools includes make else you can use brew.
- Windows: Various solutions proposed in this blog post on how to install on Windows, including Cygwin, and Windows Subsystem for Linux.

When developing on the project you will want to install the Python package locally in editable format with all the extra requirements, this can be done like so:

uv sync

Running linters and tests

This code base uses isort, flake8 and mypy to ensure that the format of the code is consistent and contain type hints. ISort and mypy settings can be found within ./pyproject.toml and the flake8 settings can be found in ./.flake8. To run these linters:

make lint

To run the tests with code coverage (NOTE these are the code coverage tests that the Continuos Integration (CI) reports at the top of this README):

make tests

Setting a different default python version

The default or recommended Python version is shown in [.python-version](./.python-version, currently 3.13, this can be changed using the uv command:

uv python pin
# uv python pin 3.13

Converting PyTorch Lightning Model to PyTorch HuggingFace Model

Some of the WSD models were originally trained using PyTorch Lightning, this section details how we convert these models to PyTorch models with a HuggingFace Pytorch Model Hub Mixin, the mixin allows the model to easily be loaded and saved from and to the HuggingFace hub, and then uploads these converted models to HuggingFace Hub.

PyMUSAS BEM models

The scripts has various arguments of which these are detailed in the help section of the script:

usage: convert_and_upload_bem_model.py [-h] [-r] [-t] [-m] hf_repository_id hf_branch model_checkpoint readme_template_path

Converts a PyTorch Lightning model to a PyTorch HuggingFace model and uploads it to the HuggingFace Hub. The script allows you to just update the model README, model tokenizer, the model itself, or any combinationof these options.

positional arguments:
  hf_repository_id      The repository ID to upload the model too on the HuggingFace Hub, e.g. ucrelnlp/PyMUSAS-Neural-Engish-Small-BEM
  hf_branch             The branch to upload the model too on the HuggingFace Hub, e.g. main, a branch named after the step the model was trained on.
  model_checkpoint      Path to the model checkpoint that you would like to upload
  readme_template_path  File path to the models README template

options:
  -h, --help            show this help message and exit
  -r, --update-readme   update model README
  -t, --update-tokenizer
                        update model tokenizer
  -m, --update-model    update model

To upload the model, tokenizer and README for all 4 models to the main branch:

uv run scripts/convert_and_upload_bem_model.py ucrelnlp/PyMUSAS-Neural-Engish-Small-BEM main checkpoints/bem_english_small/checkpoints/bem_english_small/model-step=532637-validation_accuracy=0.99394.ckpt model_readmes/pymusas_bem.md -rmt

uv run scripts/convert_and_upload_bem_model.py ucrelnlp/PyMUSAS-Neural-Engish-Base-BEM main checkpoints/bem_english_base/model-step=532637-validation_accuracy=0.99669.ckpt model_readmes/pymusas_bem.md -rmt

uv run scripts/convert_and_upload_bem_model.py ucrelnlp/PyMUSAS-Neural-Multilingual-Small-BEM main checkpoints/bem_multilingual_small/model-step=392261-validation_accuracy=0.99615.ckpt model_readmes/pymusas_bem.md -rmt

uv run scripts/convert_and_upload_bem_model.py ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM main checkpoints/bem_multilingual_base/model-step=240947-validation_accuracy=0.99625.ckpt model_readmes/pymusas_bem.md -rmt

To upload only an updated/new README:

uv run scripts/convert_and_upload_bem_model.py ucrelnlp/PyMUSAS-Neural-Engish-Small-BEM main checkpoints/bem_english_small/model-step=532637-validation_accuracy=0.99394.ckpt model_readmes/pymusas_bem.md -r

Python packages that can be removed and replaced

As of Python version 3.11:

from typing_extensions import Self - the typing_extensions package can be removed and this can be replaced with from typing import Self

Citation

Technical report is forthcoming.

Contact Information

Paul Rayson (p.rayson@lancaster.ac.uk)
Andrew Moore (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com)
UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

Apr 29, 2026

0.1.2

Dec 10, 2025

0.1.1

Dec 3, 2025

This version

0.1.0

Dec 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wsd_torch_models-0.1.0.tar.gz (32.5 kB view details)

Uploaded Dec 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wsd_torch_models-0.1.0-py3-none-any.whl (34.7 kB view details)

Uploaded Dec 2, 2025 Python 3

File details

Details for the file wsd_torch_models-0.1.0.tar.gz.

File metadata

Download URL: wsd_torch_models-0.1.0.tar.gz
Upload date: Dec 2, 2025
Size: 32.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wsd_torch_models-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c3d4ac70839c219ee459bef030a03dfa25c4cc833a3f565470d80643f0619d0d`
MD5	`d4362a19bdfbbd2e53d749d30e6d76ad`
BLAKE2b-256	`afe82f9265d9337b63f3200df15affd6ca215b658ff5dba794e4e88b4c3982dc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for wsd_torch_models-0.1.0.tar.gz:

Publisher: publish.yml on UCREL/WSD-Torch-Models

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: wsd_torch_models-0.1.0.tar.gz
- Subject digest: c3d4ac70839c219ee459bef030a03dfa25c4cc833a3f565470d80643f0619d0d
- Sigstore transparency entry: 736641244
- Sigstore integration time: Dec 2, 2025
Source repository:
- Permalink: UCREL/WSD-Torch-Models@512632686bf2d19c163ab22b460f224a06e400d1
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/UCREL
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@512632686bf2d19c163ab22b460f224a06e400d1
- Trigger Event: release

File details

Details for the file wsd_torch_models-0.1.0-py3-none-any.whl.

File metadata

Download URL: wsd_torch_models-0.1.0-py3-none-any.whl
Upload date: Dec 2, 2025
Size: 34.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wsd_torch_models-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c8f4ab556432e29a4187669aee36baeca6d4a703d140a8b347f3f82ecc80e76a`
MD5	`1abc1d1e5762b01b18149e56acf4560e`
BLAKE2b-256	`1a094734efbb2b328d51105e9e0fde24c0d1758e71d31a00adcdc3c3bf3d5699`

See more details on using hashes here.

Provenance

The following attestation bundles were made for wsd_torch_models-0.1.0-py3-none-any.whl:

Publisher: publish.yml on UCREL/WSD-Torch-Models

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: wsd_torch_models-0.1.0-py3-none-any.whl
- Subject digest: c8f4ab556432e29a4187669aee36baeca6d4a703d140a8b347f3f82ecc80e76a
- Sigstore transparency entry: 736641253
- Sigstore integration time: Dec 2, 2025
Source repository:
- Permalink: UCREL/WSD-Torch-Models@512632686bf2d19c163ab22b460f224a06e400d1
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/UCREL
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@512632686bf2d19c163ab22b460f224a06e400d1
- Trigger Event: release

wsd-torch-models 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

WSD Torch Models

Installation

Models with examples

Bi-Encoder Model (BEM)

Training Data (BEM)

Model Architecture (BEM)

Evaluation (BEM)

Development

Setup

Dev Container

Local

Running linters and tests

Setting a different default python version

Converting PyTorch Lightning Model to PyTorch HuggingFace Model

PyMUSAS BEM models

Python packages that can be removed and replaced

Citation

Contact Information

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance