Download NLP4BIA benchmarks and load datasets in their format

These details have not been verified by PyPI

Project description

NLP4BIA Library

This repository provides a Python library for loading, processing, and utilizing biomedical datasets curated by the NLP4BIA research group at the Barcelona Supercomputing Center (BSC). The datasets are specifically designed for natural language processing (NLP) tasks in the biomedical domain.

Available Dataset Loaders

The library currently supports the following dataset loaders, which are part of public benchmarks:

1. Distemist

Description: A dataset for disease mentions recognition and normalization in Spanish medical texts.
Zenodo Repository: Distemist Zenodo

2. Meddoplace

Description: A dataset for place name recognition in Spanish medical texts.
Zenodo Repository: Meddoplace Zenodo

3. Medprocner

Description: A dataset for procedure name recognition in Spanish medical texts.
Zenodo Repository: Medprocner Zenodo

4. Symptemist

Description: A dataset for symptom mentions recognition in Spanish medical texts.
Zenodo Repository: Symptemist Zenodo

Dataset Columns

filenameid: Unique identifier combining filename and offset information.
mention_class: The class of the mention (e.g., disease, symptom, etc.).
span: Text span corresponding to the mention.
code: The normalized code for the mention (usually to SNOMED CT).
sem_rel: Semantic relationships associated with the mention.
is_abbreviation: Indicates if the mention is an abbreviation.
is_composite: Indicates if the mention is a composite term.
needs_context: Indicates if the mention requires additional context.
extension_esp: Additional information specific to Spanish texts.

Gazetteer Columns

code: Normalized code for the term.
language: Language of the term.
term: The term itself.
semantic_tag: Semantic tag associated with the term.
mainterm: Indicates if the term is a primary term.

Installation

pip install nlp4bia

Quick Start Guide

Example Usage

Dataset Loaders

Here's how to use one of the dataset loaders, such as DistemistLoader:

from nlp4bia.datasets.benchmark.distemist import DistemistLoader

# Initialize loader
distemist_loader = DistemistLoader(lang="es", download_if_missing=True)

# Load and preprocess data
dis_df = distemist_loader.df
print(dis_df.head())

Dataset folders are automatically downloaded and extracted to the ~/.nlp4bia directory.

Preprocessor

Deduplication

from nlp4bia.preprocessor.deduplicator import HashDeduplicator

# Define the list of files to deduplicate
ls_files = ["path/to/file1.txt", "path/to/file2.txt"]

# Instantiate the deduplicator. It deduplicates the files using 8 cores.
hd = HashDeduplicator(ls_files, num_processes=8)

# Deduplicate the files and save the results to a CSV file
hd.get_deduplicated_files("path/tp/deduplicated_contents.csv")

Document Parser

PDFS

from nlp4bia.preprocessor.pdfparser import PDFParserMuPDF

# Define the path to the PDF file
pdf_path = "path/to/file.pdf"

# Instantiate the PDF parser
pdf_parser = PDFParserMuPDF(pdf_path)

# Extract the text from the PDF file
pdf_text = pdf_parser.extract_text()

Linking

Perform dense retrieval using the DenseRetriever class:

from sentence_transformers import SentenceTransformer
from nlp4bia.datasets.benchmark.medprocner import MedprocnerLoader, MedprocnerGazetteer
from nlp4bia.linking.retrievers import DenseRetriever

# Load the dataset and gazetteer
df_proc = MedprocnerLoader().df
gaz_proc = MedprocnerGazetteer().df
gaz_proc = gaz_proc.sort_values(by=["code", "mainterm"], 
                                ascending=[True, False]) # Make sure mainterms are first

# Load the model
model_name = "path/to/model"
st_model = SentenceTransformer(model_name)

# Create the vector database
vector_db = st_model.encode(gaz_proc["term"].tolist()[:100], show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)

# Initialize the retriever
biencoder = DenseRetriever(vector_db=vector_db, model=st_model)
biencoder.retrieve_top_k(["reparación de un desprendimiento de la retina"], gaz_proc.iloc[:100], k=10, input_format="text")

Contributing

Contributions to expand the dataset loaders or improve existing functionality are welcome! Please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

References

If you use this library or its datasets in your research, please cite the corresponding Zenodo repositories or related publications.

Instructions for Maintainers

Update the version in nlp4bia/__init__.py and in pyproject.toml.
Remove the dist folder (rm -rf dist).
Build the package (python -m build).
Check the package (twine check dist/*).
Upload the package (twine upload dist/*).
Install the package (pip install nlp4bia).

Note: to build you have to install build and twine packages:

pip install build twine

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.4.4

Jun 4, 2025

2.4.3

Jun 4, 2025

2.4.2

Jun 3, 2025

2.4.1

Jun 3, 2025

2.3.2

Jun 2, 2025

2.3.1

May 26, 2025

This version

2.3.0

May 24, 2025

2.2.0

Mar 4, 2025

2.1.10

Feb 20, 2025

2.1.9

Feb 20, 2025

2.1.8

Feb 20, 2025

2.1.7

Feb 19, 2025

2.1.6

Feb 19, 2025

2.1.5

Feb 4, 2025

2.1.4

Feb 4, 2025

2.1.3

Feb 4, 2025

2.1.2

Feb 4, 2025

2.1.1

Feb 4, 2025

2.0.6

Feb 4, 2025

2.0.5

Nov 15, 2024

2.0.4

Nov 15, 2024

2.0.3

Nov 15, 2024

2.0.2

Nov 15, 2024

2.0.1

Nov 15, 2024

2.0.0

Nov 15, 2024

1.0.0

Nov 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp4bia-2.3.0.tar.gz (20.0 kB view details)

Uploaded May 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nlp4bia-2.3.0-py3-none-any.whl (26.1 kB view details)

Uploaded May 24, 2025 Python 3

File details

Details for the file nlp4bia-2.3.0.tar.gz.

File metadata

Download URL: nlp4bia-2.3.0.tar.gz
Upload date: May 24, 2025
Size: 20.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for nlp4bia-2.3.0.tar.gz
Algorithm	Hash digest
SHA256	`76bab70edc9cc9a1a404c48ad61ae1cb8fd67f600ba5207ebc319badd4ad74bc`
MD5	`6c4266f3ece7bf34687aad6262ce4c26`
BLAKE2b-256	`ddd01023486aa64b226b94f64030160d430e858b023d9ad464dcb3c0f47067fb`

See more details on using hashes here.

File details

Details for the file nlp4bia-2.3.0-py3-none-any.whl.

File metadata

Download URL: nlp4bia-2.3.0-py3-none-any.whl
Upload date: May 24, 2025
Size: 26.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for nlp4bia-2.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`35e1f07ed9a7f1321953a10d20c84ee0689d9bd61ec0f55c24a58808ea600549`
MD5	`39826bfd56f465bd5a3b990edb22305a`
BLAKE2b-256	`af3a94aca97bd7e6d458146607a00e7602c5cd7e516c2a5cc317c06d28f0b6da`

See more details on using hashes here.

nlp4bia 2.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

NLP4BIA Library

Available Dataset Loaders

1. Distemist

2. Meddoplace

3. Medprocner

4. Symptemist

Dataset Columns

Gazetteer Columns

Installation

Quick Start Guide

Example Usage

Dataset Loaders

Preprocessor

Deduplication

Document Parser

Linking

Contributing

License

References

Instructions for Maintainers

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes