Natural language structuring library
Project description
NLStruct
Natural language struturing library. Currently, it implements a nested NER model and a span classification model, but other algorithms might follow.
If you find this library useful in your research, please consider citing:
@phdthesis{wajsburt:tel-03624928,
TITLE = {{Extraction and normalization of simple and structured entities in medical documents}},
AUTHOR = {Wajsb{\"u}rt, Perceval},
URL = {https://hal.archives-ouvertes.fr/tel-03624928},
SCHOOL = {{Sorbonne Universit{\'e}}},
YEAR = {2021},
MONTH = Dec,
KEYWORDS = {nlp ; structure ; extraction ; normalization ; clinical ; multilingual},
TYPE = {Theses},
PDF = {https://hal.archives-ouvertes.fr/tel-03624928/file/updated_phd_thesis_PW.pdf},
HAL_ID = {tel-03624928},
HAL_VERSION = {v1},
}
Features
- processes large documents seamlessly: it automatically handles tokenization and sentence splitting.
- do not train twice: an automatic caching mechanism detects when an experiment has already been run
- stop & resume with checkpoints
- easy import and export of data
- handles nested or overlapping entities
- multi-label classification of recognized entities
- strict or relaxed multi label end to end retrieval metrcis
- pretty logging with rich-logger
- heavily customizable, without config files (see train_ner.py)
- built on top of transformers and pytorch_lightning
Training models
How to train a NER model
from nlstruct.recipes import train_ner
model = train_ner(
dataset={
"train": "path to your train brat/standoff data",
"val": 0.05, # or path to your validation data
# "test": # and optional path to your test data
},
finetune_bert=False,
seed=42,
bert_name="camembert/camembert-base",
fasttext_file="",
gpus=0,
xp_name="my-xp",
return_model=True,
)
model.save_pretrained("model.pt")
How to use it
from nlstruct import load_pretrained
from nlstruct.datasets import load_from_brat, export_to_brat
ner = load_pretrained("model.pt")
ner.eval()
ner.predict({"doc_id": "doc-0", "text": "Je lui prescris du lorazepam."})
# Out:
# {'doc_id': 'doc-0',
# 'text': 'Je lui prescris du lorazepam.',
# 'entities': [{'entity_id': 0,
# 'label': ['substance'],
# 'attributes': [],
# 'fragments': [{'begin': 19,
# 'end': 28,
# 'label': 'substance',
# 'text': 'lorazepam'}],
# 'confidence': 0.9998705969553088}]}
export_to_brat(ner.predict(load_from_brat("path/to/brat/test")), filename_prefix="path/to/exported_brat")
How to train a NER model followed by a span classification model
from nlstruct.recipes import train_qualified_ner
model = train_qualified_ner(
dataset={
"train": "path to your train brat/standoff data",
"val": 0.05, # or path to your validation data
# "test": # and optional path to your test data
},
finetune_bert=False,
seed=42,
bert_name="camembert/camembert-base",
fasttext_file="",
gpus=0,
xp_name="my-xp",
return_model=True,
)
model.save_pretrained("model.pt")
Ensembling
Easily ensemble multiple models (same architecture, different seeds):
model1 = load_pretrained("model-1.pt")
model2 = load_pretrained("model-2.pt")
model3 = load_pretrained("model-3.pt")
ensemble = model1.ensemble_with([model2, model3]).cuda()
export_to_brat(ensemble.predict(load_from_brat("path/to/brat/test")), filename_prefix="path/to/exported_brat")
Advanced use
Should you need to further configure the training of a model, please modify directly one of the recipes located in the recipes folder.
Install
This project is still under development and subject to changes.
pip install nlstruct==0.2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nlstruct-0.2.0.tar.gz
(90.7 kB
view details)
Built Distribution
nlstruct-0.2.0-py3-none-any.whl
(104.1 kB
view details)
File details
Details for the file nlstruct-0.2.0.tar.gz
.
File metadata
- Download URL: nlstruct-0.2.0.tar.gz
- Upload date:
- Size: 90.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 013a220cf35fff434a8ac704bf46e2192b2be43039c1738cd18db5a8b917fd91 |
|
MD5 | 39e9f1972e07724abdacf13c40d1a75f |
|
BLAKE2b-256 | 5c17046f40653c059e7b13da662eeb59d83df35ed08d2f36f39fb7fd2133e994 |
File details
Details for the file nlstruct-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: nlstruct-0.2.0-py3-none-any.whl
- Upload date:
- Size: 104.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f42d083f44bd964c9638d4e1d554dfc8c9af0d61a4091ddd0cf1685ede5b3b7 |
|
MD5 | bb7a9005d67e1029ff1e7c3b4b861dbe |
|
BLAKE2b-256 | 841e2135e102f947800f8ccc58c1adc97dfee53f54c9a028a509df4e60698066 |