An extensible toolkit for Cross-lingual (x) Medical Entity Normalization.

These details have not been verified by PyPI

Project links

Project description

✖️MEN

xMEN is an extensible toolkit for Cross-lingual (x) Medical Entity Normalization. Through its compatibility with the BigBIO (BigScience Biomedical) framework, it can be used out-of-the box to run experiments with many open biomedical datasets. It can also be easily integrated with existing Named Entity Recognition (NER) pipelines.

Installation

xMEN is available through PyPi: pip install xmen

Development

We use Poetry for building, testing and dependency management (see pyproject.toml).

📂 Data Loading

Usually, BigBIO-compatible datasets can just be loaded from the Hugging Face Hub:

from datasets import load_dataset
dataset = load_dataset("distemist", "distemist_linking_bigbio_kb")

Integration with NER Tools

To use xMEN with existing NER pipelines, you can also create a dataset at runtime.

spaCy

from xmen.data import from_spacy
docs = ... #  list of spaCy docs with entity spans
dataset = from_spacy(docs)

🔧 Configuration and CLI

xMEN provides a convenient command line interface to prepare entity linking pipelines by creating target dictionaries and pre-computing indices to link to concepts in them.

Run xmen help to get an overview of the available commands.

Configuration is done through .yaml files. For examples, see the conf folder.

📕 Creating Dictionaries

Run xmen dict to create dictionaries to link against. Although the most common use case is to create subsets of the UMLS, it also supports passing custom parser scripts for non-UMLS dictionaries.

Note: Creating UMLS subsets requires a local installation of the UMLS metathesaurus (not only MRCONSO.RRF). In the examples, we assume that the environment variable $UMLS_HOME points to the installation path. You can either set this variable, or replace the path with your local installation.

UMLS Subsets

Example configuration for Medmentions:

name: medmentions

dict:
  umls:
    lang: 
      - en
    meta_path: ${oc.env:UMLS_HOME}/2017AA/META
    version: 2017AA
    semantic_types:
      - T005
      - T007
      - T017
      - T022
      - T031
      - T033
      - T037
      - T038
      - T058
      - T062
      - T074
      - T082
      - T091
      - T092
      - T097
      - T098
      - T103
      - T168
      - T170
      - T201
      - T204
    sabs:
      - CPT
      - FMA
      - GO
      - HGNC
      - HPO
      - ICD10
      - ICD10CM
      - ICD9CM
      - MDR
      - MSH
      - MTH
      - NCBI
      - NCI
      - NDDF
      - NDFRT
      - OMIM
      - RXNORM
      - SNOMEDCT_US

Running xmen --dict conf/medmentions.yaml creates a .jsonl file from the described UMLS subset.

Using Custom Dictionaries

Parsing scripts for custom dictionaries can be provided with the --code option (examples can be found in the dicts folder).

Example configuration for DisTEMIST:

name: distemist

dict:
  custom:
    lang: 
      - es
    distemist_path: path/to/dictionary_distemist.tsv

Running xmen dict conf/distemist.yaml --code dicts/distemist.py --key distemist_gazetteer creates a .jsonl file from the custom DisTEMIST gazetteer.

🔎 Candidate Generation

The xmen index command is used to compute term indices from a dictionary created through the dict command. If an index already exists, you will be prompted to overwrite the existing file (or pass --overwrite).

xMEN provides implementations of different neural and non-neural candidate generators

TF-IDF Weighted Character N-grams

Based on the implementation from scispaCy.

YAML file:

linker:
  candidate_generation:
    ngram:
      k: 100

Run xmen index my_config.yaml --ngram or xmen index my_config.yaml --all to create the index.

To use the linker at runtime, pass the index folder as an argument:

ngram_linker = TFIDFNGramLinker(index_base_path=<path to index>, k=100)
predictions = ngram_linker.predict_batch(dataset)

Example usage: see notebooks/BioASQ_DisTEMIST.ipynb

SapBERT

Dense Retrieval based on SapBERT embeddings.

YAML file:

linker:
  candidate_generation:
    sapbert:
      embedding_model_name: cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR
      k: 1000

Run xmen index my_config.yaml --sapbert or xmen index my_config.yaml --all to create the FAISS index.

To use the linker at runtime, pass the embedding_model_name (usually the same as was used for creating the index) and index folder as an argument. To make predictions on a batch of documents, you have to pass a batch size, as the SapBERT linker runs on the GPU by default:

sapbert_linker = SapBERTLinker(
    embedding_model_name = <name of the SapBERT model>,
    index_base_path = <path to index>,
    k = 1000
)
predictions = sapbert_linker.predict_batch(dataset, batch_size=128)

If you have loaded a yaml-config as a dictionary, you may also just pass it as kwargs:

sapbert_linker = SapBERTLinker(**config)

Example usage: see notebooks/BioASQ_DisTEMIST.ipynb

Ensemble

Different candidate generators often work well for different kinds of entity mentions, and it can be helpful to combine their predictions.

In xMEN, this can be easily achieved with an EnsembleLinker:

ensemble_linker = EnsembleLinker()
ensemble_linker.add_linker('sapbert', sapbert_linker, k=10)
ensemble_linker.add_linker('ngram', ngram_linker, k=10)

You can call predict_batch on the EnsembleLinker just as with any other linker.

Sometimes, you want to compare the ensemble performance to individual linkers and already have the candidate lists. To avoid recomputation, you can use the reuse_preds argument:

prediction = ensemble_linker.predict_batch(dataset, 128, 100, reuse_preds={'sapbert' : predictions_sap, 'ngram' : predictions_ngram'})

Note: reuse_preds currently does not support Hugging Face DatasetDict objects, so you would have to call it on each split individually.

Example usage: see notebooks/BioASQ_DisTEMIST.ipynb

🌀 Rerankers

Cross-Encoder Reranker

When labelled training data is available, a trainable reranker can improve ranking of candidate lists a lot.

To train a cross-encoder, first create a dataset of mention / candidate pairs:

from xmen.reranking.cross_encoder import CrossEncoderReranker, CrossEncoderTrainingArgs
from xmen.knowledge_base import load_kb

# Load a KB from a pre-computed dictionary (jsonl) to obtain synonyms for concept encoding
kb = load_kb('path/to/my/dictionary.jsonl')

candidates = ... # obtain prediction from candidate generator (see above)
context_length = 128 # set to adjust context length for mention encoding, more context causes larger memory footprint

cross_enc_ds = CrossEncoderReranker.prepare_data(candidates, dataset, kb, context_length)

Then you can use this dataset to train a supervised reranking model:

from xmen.reranking.cross_encoder import CrossEncoderReranker, CrossEncoderTrainingArgs

cross_encoder_model = 'bert-base-multilingual-cased' # any BERT model, potentially language specific
n_epochs = 10 # number of epochs to train
output_dir = ... # Path to temp dir for writing model checkpoints

train_args = CrossEncoderTrainingArgs(cross_encoder_model, n_epochs)

rr = CrossEncoderReranker()
rr.fit(cross_enc_ds['train'].dataset, cross_enc_ds['validation'].dataset, output_dir=output_dir, training_args=train_args)

prediction = rr.rerank_batch(candidates['test'], cross_enc_ds['test'])

Example usage:see notebooks/BioASQ_DisTEMIST.ipynb

Rule-based Reranker

TODO

💡 Pre- and Post-Processing

We support various optional components for transforming input data and result sets:

📊 Evaluation

xMEN provides implementations of common entity linking metrics (e.g., a wrapper for neleval)

Example usage: see notebooks/BioASQ_DisTEMIST.ipynb

📈 Benchmark Results

TODO

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.8

Dec 4, 2024

1.0.7

Apr 22, 2024

1.0.6

Mar 8, 2024

1.0.5

Mar 6, 2024

1.0.4

Feb 27, 2024

1.0.3

Jan 2, 2024

1.0.2

Oct 19, 2023

1.0.1

Oct 18, 2023

1.0.0

Oct 18, 2023

0.9.8

Sep 21, 2023

0.9.7

Jun 30, 2023

0.9.6

Jun 8, 2023

0.9.5

May 24, 2023

0.9.4

May 16, 2023

This version

0.9.3

May 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xmen-0.9.3.tar.gz (84.2 kB view details)

Uploaded May 16, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xmen-0.9.3-py3-none-any.whl (103.1 kB view details)

Uploaded May 16, 2023 Python 3

File details

Details for the file xmen-0.9.3.tar.gz.

File metadata

Download URL: xmen-0.9.3.tar.gz
Upload date: May 16, 2023
Size: 84.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.8.12 Linux/5.4.0-148-generic

File hashes

Hashes for xmen-0.9.3.tar.gz
Algorithm	Hash digest
SHA256	`42045329adcdb7fced0c9f1ecee85ff1e77f8b3c017f5e294b2a88f24c0afd22`
MD5	`0a061c91861aac6d399d55e0d16b4712`
BLAKE2b-256	`00457dd9b060f377fd7141fcc2781613235fc0465cdb9b5fc86ea47346f94017`

See more details on using hashes here.

File details

Details for the file xmen-0.9.3-py3-none-any.whl.

File metadata

Download URL: xmen-0.9.3-py3-none-any.whl
Upload date: May 16, 2023
Size: 103.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.8.12 Linux/5.4.0-148-generic

File hashes

Hashes for xmen-0.9.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dba5f20026d2d7e2eccbd5001fb30f3a9953014dac8c90cd25a73c016cd37b24`
MD5	`3d3d75bf9c249fd5dcf2f9a3a5a028c1`
BLAKE2b-256	`b0225706d6208bc640c68590587c23baa07ba97f5366dc35382ca7a3e2fb8f5e`

See more details on using hashes here.

xmen 0.9.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

✖️MEN

Installation

Development

📂 Data Loading

Integration with NER Tools

spaCy

🔧 Configuration and CLI

📕 Creating Dictionaries

UMLS Subsets

Using Custom Dictionaries

🔎 Candidate Generation

TF-IDF Weighted Character N-grams

SapBERT

Ensemble

🌀 Rerankers

Cross-Encoder Reranker

Rule-based Reranker

💡 Pre- and Post-Processing

📊 Evaluation

📈 Benchmark Results

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes