Skip to main content

An easy-to-use package for all your biomedical entity linking needs.

Project description

BioEL: A comprehensive package for training, evaluating, and benchmarking biomedical entity linking models.

Installation

conda create -n bioel python=3.9
conda activate bioel
pip install -e .

Development Instructions

  1. Install as in editable package using pip as shown above.
  2. Add any new dependencies to setup.py.
  3. Add tests to tests/ directory.

Ontologies

Ontologies included in the package :

Resolving abbreviations

As a preprocessing step, we resolve abbreviations in the text using Ab3P, an abbreviation detector created for biomedical text. We ran abbreviation detection on the text of all documents in our benchmark, the results of which are stored in a large dictionary in data/abbreviations.json. In order to reproduce our abbreviation detection/resolution pipeline, please run the following:

from bioel.utils.solve_abbreviation.solve_abbreviation import create_abbrev
create_abbrev(output_dir, all_dataset)
# output_path : path where to create abbreviations.json
# all_dataset : datasets for which you want the abbreviations.

Example usage

# Import modules
from bioel.model import BioEL_Model
from bioel.evaluate import Evaluate

# load model
krissbert = BioEL_Model.load_krissbert(
        name="krissbert", params_file='path/to/params_krissbert.json",
    )
# Look at data/params.json for more information about the parameters
krissbert.training() # train
krissbert.inference() # inference

abbreviations_path = "data/abbreviations.json"
dataset_names = ["ncbi_disease"]
model_names = ["krissbert"]
path_to_result = {
    "ncbi_disease": {
        "krissbert": "results/ncbi_disease.json"
    }
}
eval_strategies=["basic"]

# Results
evaluator = Evaluate(dataset_names=dataset_names, 
                     model_names=model_names, 
                     path_to_result=path_to_result, 
                     abbreviations_path=abbreviations_path, 
                     eval_strategies=eval_strategies,
                     max_k=10,
                     )
evaluator.load_results()
evaluator.process_datasets()
evaluator.evaluate()
evaluator.plot_results()
evaluator.detailed_results()

These functions will run the evaluation for all models / datasets. For error analysis with hit index details, use evaluator.error_analysis_dfs attribute. For detailed results on failure stage, accuracy per type, recall@k per type, MAP@k, statistical significance (p_values), use evaluator.detailed_results_analysis.

Config files

Examples of config files for the different models have been provided in data/ directory

Load the different ontologies

from bioel.ontology import BiomedicalOntology

##### ----------------- Load medic ----------------- #####

dataset_name = 'ncbi_disease'
medic_dict = {"name" : "medic",
            "filepath" : "path/to/medic"} # medic.tsv file

ontology = BiomedicalOntology.load_medic(**medic_dict)

##### ----------------- Load entrez ----------------- #####

dataset_name = "gnormplus" # or "nlm_gene"

entrez_dict = {"name" : "entrez",
             "filepath" : "path/to/entrez", # gene_info.tsv file
             "dataset" : f"{dataset_name}",}
ontology = BiomedicalOntology.load_entrez(**entrez_dict)

##### ----------------- Load MESH ----------------- #####

dataset_name = "nlmchem"
mesh_dict = {"name" : "mesh",
             "filepath" : "path/to/umls"}
ontology = BiomedicalOntology.load_mesh(**mesh_dict)

##### ----------------- Load UMLS (st21pv subset) ----------------- #####

dataset_name = "medmentions_st21pv"
umls_dict_st21pv = {
    "name": "umls",
    "filepath": "path/to/umls",
    "path_st21pv_cui": "data/umls_cuis_st21pv.json",
}
ontology = BiomedicalOntology.load_umls(**umls_dict_st21pv)

##### ----------------- Load UMLS (full) ----------------- #####

dataset_name = "medmentions_full"
umls_dict = {
    "name": "umls",
    "filepath": "path/to/umls",
}
ontology = BiomedicalOntology.load_umls(**umls_dict)

ArboEL

ArboEL operates in two stages: First, you need to train the biencoder (load_arboel_biencoder). Then, you use the candidate results from the biencoder to train the crossencoder (load_arboel_crossencoder) and perform evaluation with the crossencoder.

BioBART/BioGenEL

BioBART and BioGenEL share the same entity linking module:

  • In order to finetune from BioBART set the model_load_path parameter in the .json config file to GanjinZero/biobart-v2-large, it will load the pretrained weights from HuggingFace.

  • In order the finetune from BioGenEL's Knowledge base guided pretrained weights, you first must download the pretrained weights from this link: https://drive.google.com/file/d/1TqvQRau1WPYE9hKfemKZr-9ptE-7USAH/view?usp=sharing and then set the model_load_path parameter in the .json config file to the path where you stored the pretrained weights.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioel-0.1.1.tar.gz (218.8 kB view details)

Uploaded Source

File details

Details for the file bioel-0.1.1.tar.gz.

File metadata

  • Download URL: bioel-0.1.1.tar.gz
  • Upload date:
  • Size: 218.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.19

File hashes

Hashes for bioel-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1ae4214a1d7b07737d27498256c2d3635e0f4e4bd41bb6a9f1a04c2f6d05a4ab
MD5 1f686171b1b63c166d405bbb4ad0bd95
BLAKE2b-256 0fd2507dae8da19face3ac1dbf5c017360a58be57939bb8e0f9b3aa90acc382e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page