Skip to main content

SpaCy pipeline and models for Hebrew text

Project description

HebSpaCy

A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including GPE, PER, LOC and ORG.


MIT license Release PyPI version Pypi Downloads

Installation

To run the package you will need to install the package as well as the model, preferably in a virtual environment:

# Create conda env (optional)
conda create --name hebspacy python=3.8
conda activate hebsafeharbor

# Install hebspacy
pip install hebspacy

# Download and install the model (see below availbable models)
pip install </path/to/download>

Available Models

Model Description Install URL
he_ner_news_trf A full spaCy pipeline for Hebrew text including a multitask NER model trained against the BMC and NEMO corpora. Read more here. Download

Getting started

import spacy

nlp = spacy.load("he_ner_news_trf")
text = """מרגלית דהן
מספר זהות 11278904-5

2/12/2001
ביקור חוזר מ18.11.2001
במסגרת בירור פלפיטציות ואי סבילות למאמצים,מנורגיות קשות ע"ר שרירנים- ביצעה מעבדה שהדגימה:
המוגלובין 9, מיקרוציטי היפוכרומטי עם RDW 19,
פריטין 10, סטורציית טרנספרין 8%. 
מבחינת עומס נגיפי HIV- undetectable ומקפידה על HAART
"""

doc = nlp(text)
for entity in doc.ents:
    print(f"{entity.text} \t {entity.label_}: {entity._.confidence_score:.4f} ({entity.start_char},{entity.end_char})")

>>> מרגלית דהן	 PERS: 0.9999 (0,10)
>>> 2/12/2001 	 DATE: 0.9897 (33,42)
>>> מ18.11.2001 	 DATE: 0.8282 (54,65)
>>> 8% 	 PERCENT: 0.9932 (230,232)

he_ner_news_trf

'he_ner_news_trf' is a multitask model constructed from AlephBert and two NER focused heads, each trained against a different NER-annotated Hebrew corpus:

  1. NEMO corpus - annotations of the Hebrew Treebank (Haaretz newspaper) for the widely-used OntoNotes entity category: GPE (geo-political entity), PER (person), LOC (location), ORG (organization), FAC (facility), EVE (event), WOA (work-of-art), ANG (language), DUC (product).
  2. BMC corpus - annotations of articles from Israeli newspapers and websites (Haaretz newspaper, Maariv newspaper, Channel 7) for the common entity categories: PERS (person), LOC (location), ORG (organization), DATE (date), TIME (time), MONEY (money), PERCENT (percent), MISC__AFF (misc affiliation), MISC__ENT (misc entity), MISC_EVENT (misc event).

The model was developed and trained using the Hugging Face and PyTorch libraries, and was later integrated into a spaCy pipeline.

Model integration

The output model was split into three weight files: the transformer embeddings, the BMC head, and the NEMO head. The components were each packaged in a separate pipe and integrated into the custom pipeline. Furthermore, a custom NER head consolidation pipe was added last to address signal conflicts/overlaps, and sets the Doc.ents property.

To access the entities recognized by each NER head, use the Doc._.<ner_head> property (e.g., doc._.nemo_ents and doc._.bmc_ents).


Contribution

You are welcome to contribute to hebspacy project and introduce new feature/ models. Kindly follow the pipeline codebase instructions and the model training and packaging guidelines.


HebSpaCy is an open-source project developed by 8400 The Health Network.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hebspacy-0.1.7-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file hebspacy-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: hebspacy-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for hebspacy-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 65496f4d0f3bb79910dea02f2fc81a891babb140c15f45bc3fe1c425b9000876
MD5 9b0c30bf4ed69d32062c0fbb8c99ed48
BLAKE2b-256 4ed0c31d5f22ac7a10921d2be725aac22d8d9af8525befda349822bce1fcca62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page