Skip to main content

Download NLP4BIA benchmarks and load datasets in their format

Project description

NLP4BIA Library

PyPI version
License: MIT

This repository provides a Python library for loading, processing, and utilizing biomedical datasets curated by the NLP4BIA research group at the Barcelona Supercomputing Center (BSC). The datasets are specifically designed for natural language processing (NLP) tasks in the biomedical domain.


Installation

pip install nlp4bia

Introduction

NLP4BIA is a Python package for working with curated biomedical NLP datasets in Spanish. Developed by the NLP4BIA research group at the Barcelona Supercomputing Center (BSC), it provides:

  • Dataset Loaders for public benchmarks like Distemist, Meddoplace, Medprocner, Symptemist.
  • Preprocessing Utilities such as deduplication, PDF parsing, and more.
  • Linking Tools to perform dense retrieval against medical gazetteers (e.g., SNOMED CT) using SentenceTransformers.

Whether you’re training new NLP models on Spanish clinical text, sanitizing raw medical documents, or performing terminology linking, NLP4BIA aims to streamline your workflow.

Available Dataset Loaders

The library currently supports the following dataset loaders, which are part of public benchmarks:

1. Distemist

  • Description: A dataset for disease mentions recognition and normalization in Spanish medical texts.
  • Zenodo Repository: Distemist Zenodo

2. Meddoplace

  • Description: A dataset for place name recognition in Spanish medical texts.
  • Zenodo Repository: Meddoplace Zenodo

3. Medprocner

  • Description: A dataset for procedure name recognition in Spanish medical texts.
  • Zenodo Repository: Medprocner Zenodo

4. Symptemist

  • Description: A dataset for symptom mentions recognition in Spanish medical texts.
  • Zenodo Repository: Symptemist Zenodo

Dataset Columns

Column Name Type/Example Description
filenameid "12345_678" Unique ID combining filename and character offsets.
mention_class "ENFERMEDAD" Class of the mention (disease, symptom, procedure, etc.).
span "diabetes tipo 2" Text span corresponding to the mention.
code "44054006" Normalized SNOMED CT code for the mention.
sem_rel "EXACT"/"NARROW"/"COMPOSITE" EXACT: The mention matches perfectly with the associated term; NARROW: it is not exactly the same but a term parent not in the ontology; COMPOSITE: needs more than one code to be defined (e.g. 1243535+13452543)
is_abbreviation True / False Whether the mention is an abbreviation.
is_composite True / False Whether the mention is a composite term.
needs_context True / False Whether extra context is required to interpret the span.
extension_esp "info adicional" Extra fields specific to Spanish texts.

Gazetteer Columns

Column Name Type/Example Description
code "44054006" SNOMED CT code for the term.
language "es" Language of the term (e.g., "es", "en").
term "diabetes" The term itself (string).
semantic_tag "disorder" Semantic tag associated with the term.
mainterm True / False Whether this is a primary (“preferred”) term or a synonym.

Quick Start Guide

Example Usage

Dataset Loaders

Here's how to use one of the dataset loaders, such as DistemistLoader:

from nlp4bia.datasets.benchmark.distemist import DistemistLoader

# Initialize loader
distemist_loader = DistemistLoader(lang="es", download_if_missing=True)

# Load and preprocess data
dis_df = distemist_loader.df
print(dis_df.head())

Dataset folders are automatically downloaded and extracted to the ~/.nlp4bia directory.

Preprocessor

Deduplication
from nlp4bia.preprocessor.deduplicator import HashDeduplicator

# Define the list of files to deduplicate
ls_files = ["path/to/file1.txt", "path/to/file2.txt"]

# Instantiate the deduplicator. It deduplicates the files using 8 cores.
hd = HashDeduplicator(ls_files, num_processes=8)

# Deduplicate the files and save the results to a CSV file
hd.get_deduplicated_files("path/tp/deduplicated_contents.csv")
Document Parser

PDFS

from nlp4bia.preprocessor.pdfparser import PDFParserMuPDF

# Define the path to the PDF file
pdf_path = "path/to/file.pdf"

# Instantiate the PDF parser
pdf_parser = PDFParserMuPDF(pdf_path)

# Extract the text from the PDF file
pdf_text = pdf_parser.extract_text()

Linking

Perform dense retrieval using the DenseRetriever class:

from sentence_transformers import SentenceTransformer
from nlp4bia.datasets.benchmark.medprocner import MedprocnerLoader, MedprocnerGazetteer
from nlp4bia.linking.retrievers import DenseRetriever

# Load the dataset and gazetteer
df_proc = MedprocnerLoader().df
gaz_proc = MedprocnerGazetteer().df
gaz_proc = gaz_proc.sort_values(by=["code", "mainterm"], 
                                ascending=[True, False]) # Make sure mainterms are first

# Load the model
model_name = "path/to/model"
st_model = SentenceTransformer(model_name)

# Create the vector database
vector_db = st_model.encode(gaz_proc["term"].tolist()[:100], 
                            show_progress_bar=True, 
                            convert_to_tensor=True, 
                            normalize_embeddings=True)

# Initialize the retriever
biencoder = DenseRetriever(vector_db=vector_db, model=st_model)
biencoder.retrieve_top_k(["reparación de un desprendimiento de la retina"], 
                          gaz_proc.iloc[:100], 
                          k=10, 
                          input_format="text")

Contributing

Contributions to expand the dataset loaders or improve existing functionality are welcome! Please open an issue or submit a pull request.


License

This project is licensed under the MIT License. See the LICENSE file for details.


References

If you use this library or its datasets in your research, please cite the corresponding Zenodo repositories or related publications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp4bia-2.4.1.tar.gz (31.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlp4bia-2.4.1-py3-none-any.whl (41.4 kB view details)

Uploaded Python 3

File details

Details for the file nlp4bia-2.4.1.tar.gz.

File metadata

  • Download URL: nlp4bia-2.4.1.tar.gz
  • Upload date:
  • Size: 31.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for nlp4bia-2.4.1.tar.gz
Algorithm Hash digest
SHA256 e90db3bc5e1dc156194ab8eac249dd12cb6a7a14c077d8d6f5d7b64e77e76a0f
MD5 4a0c839f1a8211514e54a3776c3caf82
BLAKE2b-256 bbfe9c67979fa97bb2634b009fb7be455ef055cd467bf0a65766fd9a63e3775c

See more details on using hashes here.

File details

Details for the file nlp4bia-2.4.1-py3-none-any.whl.

File metadata

  • Download URL: nlp4bia-2.4.1-py3-none-any.whl
  • Upload date:
  • Size: 41.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for nlp4bia-2.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4504a23b518df31aac5b93422f85f8b8ae6da3178d87c40ac5db80c36aea4b7d
MD5 9153756e767f60e4eac67a77446e62d2
BLAKE2b-256 46122886386131ecb7a6824383ed07ffc7ace0f7cf1b703b9cc3997dc3bf448e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page