Skip to main content

Download NLP4BIA benchmarks and load datasets in their format

Project description

NLP4BIA Library

This repository provides a Python library for loading, processing, and utilizing biomedical datasets curated by the NLP4BIA research group at the Barcelona Supercomputing Center (BSC). The datasets are specifically designed for natural language processing (NLP) tasks in the biomedical domain.


Available Dataset Loaders

The library currently supports the following dataset loaders, which are part of public benchmarks:

1. Distemist

  • Description: A dataset for disease mentions recognition and normalization in Spanish medical texts.
  • Zenodo Repository: Distemist Zenodo

2. Meddoplace

  • Description: A dataset for place name recognition in Spanish medical texts.
  • Zenodo Repository: Meddoplace Zenodo

3. Medprocner

  • Description: A dataset for procedure name recognition in Spanish medical texts.
  • Zenodo Repository: Medprocner Zenodo

4. Symptemist

  • Description: A dataset for symptom mentions recognition in Spanish medical texts.
  • Zenodo Repository: Symptemist Zenodo

Installation

pip install nlp4bia

Quick Start Guide

Example Usage

Here's how to use one of the dataset loaders, such as DistemistLoader:

from nlp4bia.datasets.benchmark.distemist import DistemistLoader

# Initialize loader
distemist_loader = DistemistLoader(lang="es", download_if_missing=True)

# Load and preprocess data
dis_df = distemist_loader.df
print(dis_df.head())

Dataset folders are automatically downloaded and extracted to the ~/.nlp4bia directory.

Column Descriptions

Dataset Columns

  • filenameid: Unique identifier combining filename and offset information.
  • mention_class: The class of the mention (e.g., disease, symptom, etc.).
  • span: Text span corresponding to the mention.
  • code: The normalized code for the mention (usually to SNOMED CT).
  • sem_rel: Semantic relationships associated with the mention.
  • is_abbreviation: Indicates if the mention is an abbreviation.
  • is_composite: Indicates if the mention is a composite term.
  • needs_context: Indicates if the mention requires additional context.
  • extension_esp: Additional information specific to Spanish texts.

Gazetteer Columns

  • code: Normalized code for the term.
  • language: Language of the term.
  • term: The term itself.
  • semantic_tag: Semantic tag associated with the term.
  • mainterm: Indicates if the term is a primary term.

Contributing

Contributions to expand the dataset loaders or improve existing functionality are welcome! Please open an issue or submit a pull request.


License

This project is licensed under the MIT License. See the LICENSE file for details.


References

If you use this library or its datasets in your research, please cite the corresponding Zenodo repositories or related publications.


Instructions for Maintainers

  1. Update the version in nlp4bia/__init__.py and in pyproject.toml.
  2. Remove the dist folder (rm -rf dist).
  3. Build the package (python -m build).
  4. Check the package (twine check dist/*).
  5. Upload the package (twine upload dist/*).
  6. Install the package (pip install nlp4bia).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp4bia-2.1.1.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlp4bia-2.1.1-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file nlp4bia-2.1.1.tar.gz.

File metadata

  • Download URL: nlp4bia-2.1.1.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for nlp4bia-2.1.1.tar.gz
Algorithm Hash digest
SHA256 71607afa396442c968145b1befddee718d50053bbcdefec53d0104b9471b645f
MD5 78780e9bd1625f54a5e96785e2ab9c6a
BLAKE2b-256 2b381dfa2164b31404b8b01b271a7df4d7111a5fb04efec2c9c53dfb637a501f

See more details on using hashes here.

File details

Details for the file nlp4bia-2.1.1-py3-none-any.whl.

File metadata

  • Download URL: nlp4bia-2.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for nlp4bia-2.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a47d164361cb86ca869362f1920f4b5b32277d59120674c0e5e5bfddc62393c8
MD5 db7174b5b8a3f52d26c5526e58e565bb
BLAKE2b-256 047196edc82b87c6efc2c0178d1d7f6851b98ba7059b4eec5b97632b4b0480d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page