Biomedical Named Entity Recognition and Entity Linking for Enterprise use cases

These details have not been verified by PyPI

Project links

Intended Audience
- Healthcare Industry
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python :: 3 :: Only
Topic
- Scientific/Engineering :: Medical Science Apps.
Typing
- Typed

Project description

Maturity level-1

Kazu - Biomedical NLP Framework

Welcome to Kazu (Korea AstraZeneca University), a python biomedical NLP framework built in collaboration with Korea University, designed to handle production workloads.

This library aims to simplify the process of using state of the art NLP research in production systems. Some of the research contained within are our own, but most of it comes from the community, for which we are immensely grateful.

If you want to use Kazu, please cite our EMNLP 2022 publication! (citation link)

Please click here for the web live demo (Swagger UI) from http://kazu.korea.ac.kr/

Please click here for the TinyBERN2 training and evaluation code

Quickstart

Install

Python version 3.9 or higher is required (tested with Python 3.9).

Either:

pip install kazu

or download the wheel from the release page and install locally.

If you intend to use Mypy on your own codebase, consider installing Kazu using:

pip install kazu[typed]

This will pull in typing stubs for kazu's dependencies (such as types-requests for Requests) so that mypy has access to as much relevant typing information as possible when type checking your codebase. Otherwise (depending on mypy config), you may see errors when running mypy like:

.venv/lib/python3.10/site-packages/kazu/steps/linking/post_processing/xref_manager.py:10: error: Library stubs not installed for "requests" [import]

Getting the model pack

For most functionality, you will also need the Kazu model pack. This is tied to each release, and can be found on the release page. Once downloaded, extract the archive and:

export KAZU_MODEL_PACK=<path to the extracted archive>

Kazu is highly configurable (using Hydra), although it comes preconfigured with defaults appropriate for most literature processing use cases. To make use of these, and process a simple document:

import hydra
from hydra.utils import instantiate

from kazu.data.data import Document
from kazu.pipeline import Pipeline
from kazu.utils.constants import HYDRA_VERSION_BASE
from pathlib import Path
import os

# the hydra config is kept in the model pack
cdir = Path(os.environ["KAZU_MODEL_PACK"]).joinpath("conf")


@hydra.main(version_base=HYDRA_VERSION_BASE, config_path=str(cdir), config_name="config")
def kazu_test(cfg):
    pipeline: Pipeline = instantiate(cfg.Pipeline)
    text = "EGFR mutations are often implicated in lung cancer"
    doc = Document.create_simple_document(text)
    pipeline([doc])
    print(f"{doc.get_entities()}")


if __name__ == "__main__":
    kazu_test()

Documentation

Find our docs here

License

Licensed under Apache 2.0.

Kazu includes elements under compatible licenses (full licenses are in relevant files or as indicated):

Some elements are a modification of code licensed under MIT by Explosion.AI - see the README here.
The doc build process (conf.py's linkcode_resolve function) uses code modified from pandas, in turn modified from numpy. See PANDAS_LICENSE.txt and NUMPY_LICENSE.txt
Elements of the model distillation code are inspired by or modified from Huawei Noah's Ark Lab TinyBERT and DMIS-Lab's BioBERT. See the details in dataprocessor.py, models.py and tiny_transformer.py.
PLSapbertModel is inspired by the code from sapbert, licensed under MIT. See the file for details, and see the SapBert section below regarding use of the model.
GildaUtils in the string_normalizer.py file is modified from Gilda. See the file for full details including the full BSD 2-Clause license.
The AbbreviationFinderStep uses KazuAbbreviationDetector, which is a modified version of SciSpacy's abbreviation finding algorithm, licensed under Apache 2.0 - see the files for full details.
The JWTAuthenticationBackend Starlette Middleware in jwtauth.py is originally from starlette-jwt, licensed under BSD 3-Clause.
The AddRequestIdMiddleware Starlette Middleware in req_id_header.py is modified from 'CustomHeaderMiddleware' in the Starlette Middleware docs. This is licensed under BSD 3-Clause along with the rest of Starlette.
The kazu-jvm folder includes files like gradelw and gradelw.bat distributed by gradle under Apache 2.0 - see the files for details.
kazu/data/data.py contains AutoNameEnum, which is AutoName from the Python Enum Docs licensed under Zero-Clause BSD.

Dataset licences

Under Creative Commons Attribution-Share Alike 3.0 Unported Licence

Chembl

ChEMBL data is from http://www.ebi.ac.uk/chembl - the version of ChEMBL is ChEMBL_29

CLO

CLO data is from http://www.ebi.ac.uk/ols/ontologies/clo - downloaded 18th October 2021

UBERON

UBERON data is from http://www.ebi.ac.uk/ols/ontologies/uberon - downloaded 18th October 2021

Under Creative Commons Attribution 4.0 Unported License

MONDO

MONDO data is from http://www.ebi.ac.uk/ols/ontologies/mondo - downloaded 29th July 2022

CELLOSAURUS

CELLOSAURUS data is from https://www.cellosaurus.org/ - downloaded 8th November 2021

Gene Ontology

Gene Ontology data is from (version https://zenodo.org/record/7186998#.Y2OcR-zP3iM )

Other licenced datasets and models

OPEN TARGETS

Open Targets datasets are kindly provided by www.opentargets.org, which are free for commercial use cases https://platform-docs.opentargets.org/licence

Ochoa, D. et al. (2021). Open Targets Platform: supporting systematic drug–target identification and prioritisation. Nucleic Acids Research. https://doi.org/10.1093/nar/gkaa1027

STANZA

The Stanza framework:

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Association for Computational Linguistics (ACL) System Demonstrations. 2020. https://arxiv.org/abs/2003.07082

Biomedical NLP models are derived from:

Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D. Manning, Curtis P. Langlotz. Biomedical and Clinical English Model Packages in the Stanza Python NLP Library, Journal of the American Medical Informatics Association. 2021. https://doi.org/10.1093/jamia/ocab090

SCISPACY

Biomedical scispacy models are derived from

Mark Neumann, Daniel King, Iz Beltagy, Waleed Ammar ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing Proceedings of the 18th BioNLP Workshop and Shared Task ACL 2019 https://www.aclweb.org/anthology/W19-5034

SAPBERT

Kazu uses a distilled form of SAPBERT, from

Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, Nigel Collier Self-Alignment Pretraining for Biomedical Entity Representations ACL 2021 https://aclanthology.org/2021.naacl-main.334/

SETH

Kazu's SethStep uses Py4j to call the SETH mutation finder.

Thomas, P., Rocktäschel, T., Hakenberg, J., Mayer, L., and Leser, U. (2016). SETH detects and normalizes genetic variants in text Bioinformatics (2016) http://dx.doi.org/10.1093/bioinformatics/btw234

Project details

These details have not been verified by PyPI

Project links

Intended Audience
- Healthcare Industry
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python :: 3 :: Only
Topic
- Scientific/Engineering :: Medical Science Apps.
Typing
- Typed

Release history Release notifications | RSS feed

2.3.0

Dec 17, 2024

2.2.1

Oct 21, 2024

2.2.0

Sep 18, 2024

2.1.1

Jul 8, 2024

2.1.0

Jul 8, 2024

2.0.0

Jun 4, 2024

1.5.1

Jan 30, 2024

1.5.0

Jan 30, 2024

1.4.0

Dec 5, 2023

1.3.2

Dec 5, 2023

1.3.0

Nov 15, 2023

1.2.0

Oct 19, 2023

1.1.2

Oct 12, 2023

1.1.1

Oct 11, 2023

1.1.0

Oct 11, 2023

1.0.3

Aug 18, 2023

1.0.2

Aug 10, 2023

1.0.1

Aug 10, 2023

1.0.0

Jul 24, 2023

This version

0.1.0

Apr 3, 2023

0.0.25

Mar 7, 2023

0.0.24

Feb 20, 2023

0.0.19

Dec 14, 2022

0.0.16

Dec 9, 2022

0.0.15

Dec 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kazu-0.1.0.tar.gz (144.0 kB view details)

Uploaded Apr 3, 2023 Source

Built Distribution

kazu-0.1.0-py3-none-any.whl (197.7 kB view details)

Uploaded Apr 3, 2023 Python 3

File details

Details for the file kazu-0.1.0.tar.gz.

File metadata

Download URL: kazu-0.1.0.tar.gz
Upload date: Apr 3, 2023
Size: 144.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.23.3

File hashes

Hashes for kazu-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`faabeb9e43e8fb96f4aea8d18a25f6ef86bf9528aea9e88df9af170834e272e4`
MD5	`43b13d1dbc81157e87679e0a0fb61de8`
BLAKE2b-256	`a5a0d38350724382b610ab57de2e59947d2d1f45982f8375b456f1690630bf30`

See more details on using hashes here.

File details

Details for the file kazu-0.1.0-py3-none-any.whl.

File metadata

Download URL: kazu-0.1.0-py3-none-any.whl
Upload date: Apr 3, 2023
Size: 197.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.23.3

File hashes

Hashes for kazu-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c6315fbbb198f2987e5cbb2221356525f5c9d15d8545d79bffeabf880d270bde`
MD5	`43520ec3496fb80da2114fbfa13953ea`
BLAKE2b-256	`2f06d8d65145e8b53188c007e5000fe39ae30d38a4c61a0868648e20e2dd7df8`

See more details on using hashes here.

kazu 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Kazu - Biomedical NLP Framework

Quickstart

Install

Getting the model pack

Documentation

License

Dataset licences

Under Creative Commons Attribution-Share Alike 3.0 Unported Licence

Chembl

CLO

UBERON

Under Creative Commons Attribution 4.0 Unported License

MONDO

CELLOSAURUS

Gene Ontology

Other licenced datasets and models

OPEN TARGETS

STANZA

SCISPACY

SAPBERT

SETH

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes