Skip to main content

An extensible toolkit for Cross-lingual (x) Medical Entity Normalization.

Project description

Build pypi Version Code style: black

✖️MEN

xMEN is an extensible toolkit for Cross-lingual (x) Medical Entity Normalization. Through its compatibility with the BigBIO (BigScience Biomedical) framework, it can be used out-of-the box to run experiments with many open biomedical datasets. It can also be easily integrated with existing Named Entity Recognition (NER) pipelines.

Installation

xMEN is available through PyPi:

pip install xmen

Development

We use Poetry for building, testing and dependency management (see pyproject.toml).

🚀 Getting Started

A very simple pipeline highlighting the main components of xMEN can be found in notebooks/00_Getting_Started.ipynb

📂 Data Loading

Usually, BigBIO-compatible datasets can just be loaded from the Hugging Face Hub:

from datasets import load_dataset
dataset = load_dataset("distemist", "distemist_linking_bigbio_kb")

Integration with NER Tools

To use xMEN with existing NER pipelines, you can also create a dataset at runtime.

spaCy

from xmen.data import from_spacy
docs = ... #  list of spaCy docs with entity spans
dataset = from_spacy(docs)

SpanMarker

from span_marker import SpanMarkerModel

sentences = ... # list of sentences
model = SpanMarkerModel.from_pretrained(...)
preds = model.predict(sentences)

from xmen.data import from_spans
dataset = from_spans(preds, sentences)

🔧 Configuration and CLI

xMEN provides a convenient command line interface to prepare entity linking pipelines by creating target dictionaries and pre-computing indices to link to concepts in them.

Run xmen help to get an overview of the available commands.

Configuration is done through .yaml files. For examples, see the /examples/conf folder.

📕 Creating Dictionaries

Run xmen dict to create dictionaries to link against. Although the most common use case is to create subsets of the UMLS, it also supports passing custom parser scripts for non-UMLS dictionaries.

Note: Creating UMLS subsets requires a local installation of the UMLS metathesaurus (not only MRCONSO.RRF). In the examples, we assume that the environment variable $UMLS_HOME points to the installation path. You can either set this variable, or replace the path with your local installation.

UMLS Subsets

Example configuration for Medmentions:

name: medmentions

dict:
  umls:
    lang: 
      - en
    meta_path: ${oc.env:UMLS_HOME}/2017AA/META
    version: 2017AA
    semantic_types:
      - T005
      - T007
      - T017
      - T022
      - T031
      - T033
      - T037
      - T038
      - T058
      - T062
      - T074
      - T082
      - T091
      - T092
      - T097
      - T098
      - T103
      - T168
      - T170
      - T201
      - T204
    sabs:
      - CPT
      - FMA
      - GO
      - HGNC
      - HPO
      - ICD10
      - ICD10CM
      - ICD9CM
      - MDR
      - MSH
      - MTH
      - NCBI
      - NCI
      - NDDF
      - NDFRT
      - OMIM
      - RXNORM
      - SNOMEDCT_US

Running xmen --dict examples/conf/medmentions.yaml creates a .jsonl file from the described UMLS subset.

Using Custom Dictionaries

Parsing scripts for custom dictionaries can be provided with the --code option (examples can be found in the dicts folder).

Example configuration for DisTEMIST:

name: distemist

dict:
  custom:
    lang: 
      - es
    distemist_path: local_files/dictionary_distemist.tsv

Running xmen dict examples/conf/distemist.yaml --code examples/dicts/distemist.py creates a .jsonl file from the custom DisTEMIST gazetteer (which you can download from Zenodo and put into any folder, e.g., local_files).

🔎 Candidate Generation

The xmen index command is used to compute term indices from a dictionary created through the dict command. If an index already exists, you will be prompted to overwrite the existing file (or pass --overwrite).

xMEN provides implementations of different neural and non-neural candidate generators

TF-IDF Weighted Character N-grams

Based on the implementation from scispaCy.

Run xmen index my_config.yaml --ngram or xmen index my_config.yaml --all to create the index.

To use the linker at runtime, pass the index folder as an argument:

from xmen.linkers import TFIDFNGramLinker

ngram_linker = TFIDFNGramLinker(index_base_path="/path/to/my/index/ngram", k=100)
predictions = ngram_linker.predict_batch(dataset)

SapBERT

Dense Retrieval based on SapBERT embeddings.

YAML file (optional, if you want to configure another Transformer model):

linker:
  candidate_generation:
    sapbert:
      model_name: cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR

Run xmen index my_config.yaml --sapbert or xmen index my_config.yaml --all to create the FAISS index.

To use the linker at runtime, pass the index folder as an argument. To make predictions on a batch of documents, you have to pass a batch size, as the SapBERT linker runs on the GPU by default:

from xmen.linkers import SapBERTLinker

sapbert_linker = SapBERTLinker(
    index_base_path = "/path/to/my/index/sapbert",
    k = 1000
)
predictions = sapbert_linker.predict_batch(dataset, batch_size=128)

If you have loaded a yaml-config as a dictionary-like object, you may also just pass it as kwargs:

sapbert_linker = SapBERTLinker(**config)

By default, SapBERT assumes a CUDA device is available. If you want to disable cuda, pass cuda=False to the constructor.

Ensemble

Different candidate generators often work well for different kinds of entity mentions, and it can be helpful to combine their predictions.

In xMEN, this can be easily achieved with an EnsembleLinker:

from xmen.linkers import EnsembleLinker

ensemble_linker = EnsembleLinker()
ensemble_linker.add_linker('sapbert', sapbert_linker, k=10)
ensemble_linker.add_linker('ngram', ngram_linker, k=10)

or (as a shortcut for the combination of TFIDFNGramLinker and SapBERTLinker):

from xmen.linkers import default_ensemble

ensemble_linker = default_ensemble("/path/to/my/index/")

You can call predict_batch on the EnsembleLinker just as with any other linker.

Sometimes, you want to compare the ensemble performance to individual linkers and already have the candidate lists. To avoid recomputation, you can use the reuse_preds argument:

prediction = ensemble_linker.predict_batch(dataset, 128, 100, reuse_preds={'sapbert' : predictions_sap, 'ngram' : predictions_ngram'})

🌀 Entity Rankers

Cross-encoder Re-ranker

When labelled training data is available, a trainable re-ranker can improve ranking of candidate lists a lot.

To train a cross-encoder model, first create a dataset of mention / candidate pairs:

from xmen.reranking.cross_encoder import CrossEncoderReranker, CrossEncoderTrainingArgs
from xmen import load_kb

# Load a KB from a pre-computed dictionary (jsonl) to obtain synonyms for concept encoding
kb = load_kb('path/to/my/dictionary.jsonl')

# Obtain prediction from candidate generator (see above)
candidates = linker.predict_batch(dataset)

ce_dataset = CrossEncoderReranker.prepare_data(candidates, dataset, kb)

Then you can use this dataset to train a supervised reranking model:

# Number of epochs to train
n_epochs = 10

# Any BERT model, potentially language-specific
cross_encoder_model = 'bert-base-multilingual-cased'

args = CrossEncoderTrainingArgs(n_epochs, cross_encoder_model)

rr = CrossEncoderReranker()

# Fit the model
rr.fit(args, ce_dataset['train'].dataset, ce_dataset['validation'].dataset)

# Predict on test set
prediction = rr.rerank_batch(candidates['test'], ce_dataset['test'])

Pre-trained Cross-encoders

We provide pre-trained models, based on automatically translated versions of MedMentions (see notebooks/01_Translation.ipynb).

Instead of fitting the Cross-encoder model, you can just load a pre-trained model, e.g., for French:

rr = CrossEncoderReranker.load('phlobo/xmen-fr-ce-medmentions', device=0)

The pre-trained models are available on the Hugging Face Hub: https://huggingface.co/models?library=xmen

💡 Pre- and Post-processing

We support various optional components for transforming input data and result sets in xmen.data:

📊 Evaluation

xMEN provides implementations of common entity linking metrics (e.g., a wrapper for neleval) and utilities for error analysis.

from xmen.evaluation import evaluate, error_analysis

# Runs the evaluation
eval_results = evaluate(ground_truth, predictions)

# Performs error analysis
error_dataframe = error_analysis(ground_truth, predictions))

Citation

If you use xMen in your work, please cite the following paper:

Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich, and Matthieu-P Schapranow. xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization. arXiv preprint arXiv:2310.11275 (2023). http://arxiv.org/abs/2310.11275.

BibTex:

@article{
      borchert2023xmen,
      title={{xMEN}: A Modular Toolkit for Cross-Lingual Medical Entity Normalization}, 
      author={Florian Borchert and Ignacio Llorca and Roland Roller and Bert Arnrich and Matthieu-P. Schapranow},
      year={2023},
      url={https://arxiv.org/abs/2310.11275},
      journal={arXiv preprint arXiv:2310.11275}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xmen-1.0.0.tar.gz (90.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xmen-1.0.0-py3-none-any.whl (108.9 kB view details)

Uploaded Python 3

File details

Details for the file xmen-1.0.0.tar.gz.

File metadata

  • Download URL: xmen-1.0.0.tar.gz
  • Upload date:
  • Size: 90.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.8.12 Linux/5.4.0-163-generic

File hashes

Hashes for xmen-1.0.0.tar.gz
Algorithm Hash digest
SHA256 275d556956042dfe4b88979a949c75f841bff946fc5cd79d914e4b2ea0196264
MD5 ecec8d1d10b511b2096d5bf95a87a2e3
BLAKE2b-256 2d35b7e9a6557b9c0b410b6649a668e1911da4dee0a3358e0b18d29b9dab88dc

See more details on using hashes here.

File details

Details for the file xmen-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: xmen-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 108.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.8.12 Linux/5.4.0-163-generic

File hashes

Hashes for xmen-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5628267b64c8c99b6ced049cc42dbd3fb2a5fa32e73f72b31cdc4a4e0226d8d5
MD5 0c28a1d609cbf073a2557081b6e5fd13
BLAKE2b-256 b2a693a81168d6b56dfeedfbf9cc187223967234c3ea19be9ec48bb4c8e643be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page