SpaCy pipeline and models for Hebrew text
Project description
HebSpaCy
A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including GPE, PER, LOC and ORG.
Installation
To run the package you will need to install the package as well as the model, preferably in a virtual environment:
# Create conda env (optional)
conda create --name hebspacy python=3.8
conda activate hebsafeharbor
# Install hebspacy
pip install hebspacy
# Download and install the model (see below availbable models)
pip install </path/to/download>
Available Models
| Model | Description | Install URL |
|---|---|---|
| he_ner_news_trf | A full spaCy pipeline for Hebrew text including a multitask NER model trained against the BMC and NEMO corpora. Read more here. | Download |
Getting started
import spacy
nlp = spacy.load("he_ner_news_trf")
text = """מרגלית דהן
מספר זהות 11278904-5
2/12/2001
ביקור חוזר מ18.11.2001
במסגרת בירור פלפיטציות ואי סבילות למאמצים,מנורגיות קשות ע"ר שרירנים- ביצעה מעבדה שהדגימה:
המוגלובין 9, מיקרוציטי היפוכרומטי עם RDW 19,
פריטין 10, סטורציית טרנספרין 8%.
מבחינת עומס נגיפי HIV- undetectable ומקפידה על HAART
"""
doc = nlp(text)
for entity in doc.ents:
print(f"{entity.text} \t {entity.label_}: {entity._.confidence_score:.4f} ({entity.start_char},{entity.end_char})")
>>> מרגלית דהן PERS: 0.9999 (0,10)
>>> 2/12/2001 DATE: 0.9897 (33,42)
>>> מ18.11.2001 DATE: 0.8282 (54,65)
>>> 8% PERCENT: 0.9932 (230,232)
he_ner_news_trf
'he_ner_news_trf' is a multitask model constructed from AlephBert and two NER focused heads, each trained against a different NER-annotated Hebrew corpus:
- NEMO corpus - annotations of the Hebrew Treebank (Haaretz newspaper) for the widely-used OntoNotes entity category:
GPE(geo-political entity),PER(person),LOC(location),ORG(organization),FAC(facility),EVE(event),WOA(work-of-art),ANG(language),DUC(product). - BMC corpus - annotations of articles from Israeli newspapers and websites (Haaretz newspaper, Maariv newspaper, Channel 7) for the common entity categories:
PERS(person),LOC(location),ORG(organization),DATE(date),TIME(time),MONEY(money),PERCENT(percent),MISC__AFF(misc affiliation),MISC__ENT(misc entity),MISC_EVENT(misc event).
The model was developed and trained using the Hugging Face and PyTorch libraries, and was later integrated into a spaCy pipeline.
Model integration
The output model was split into three weight files: the transformer embeddings, the BMC head, and the NEMO head.
The components were each packaged in a separate pipe and integrated into the custom pipeline.
Furthermore, a custom NER head consolidation pipe was added last to address signal conflicts/overlaps, and sets the Doc.ents property.
To access the entities recognized by each NER head, use the Doc._.<ner_head> property (e.g., doc._.nemo_ents and doc._.bmc_ents).
Contribution
You are welcome to contribute to hebspacy project and introduce new feature/ models.
Kindly follow the pipeline codebase instructions and the model training and packaging guidelines.
HebSpaCy is an open-source project developed by 8400 The Health Network.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hebspacy-0.1.7-py3-none-any.whl.
File metadata
- Download URL: hebspacy-0.1.7-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65496f4d0f3bb79910dea02f2fc81a891babb140c15f45bc3fe1c425b9000876
|
|
| MD5 |
9b0c30bf4ed69d32062c0fbb8c99ed48
|
|
| BLAKE2b-256 |
4ed0c31d5f22ac7a10921d2be725aac22d8d9af8525befda349822bce1fcca62
|