Natural language structuring library
Project description
NLStruct
Natural language struturing library. Currently, it implements a nested NER model and a span classification model, but other algorithms might follow.
If you find this library useful in your research, please consider citing:
@phdthesis{wajsburt:tel-03624928,
TITLE = {{Extraction and normalization of simple and structured entities in medical documents}},
AUTHOR = {Wajsb{\"u}rt, Perceval},
URL = {https://hal.archives-ouvertes.fr/tel-03624928},
SCHOOL = {{Sorbonne Universit{\'e}}},
YEAR = {2021},
MONTH = Dec,
KEYWORDS = {nlp ; structure ; extraction ; normalization ; clinical ; multilingual},
TYPE = {Theses},
PDF = {https://hal.archives-ouvertes.fr/tel-03624928/file/updated_phd_thesis_PW.pdf},
HAL_ID = {tel-03624928},
HAL_VERSION = {v1},
}
Features
- processes large documents seamlessly: it automatically handles tokenization and sentence splitting.
- do not train twice: an automatic caching mechanism detects when an experiment has already been run
- stop & resume with checkpoints
- easy import and export of data
- handles nested or overlapping entities
- multi-label classification of recognized entities
- strict or relaxed multi label end to end retrieval metrcis
- pretty logging with rich-logger
- heavily customizable, without config files (see train_ner.py)
- built on top of transformers and pytorch_lightning
Training models
How to train a NER model
from nlstruct.recipes import train_ner
model = train_ner(
dataset={
"train": "path to your train brat/standoff data",
"val": 0.05, # or path to your validation data
# "test": # and optional path to your test data
},
finetune_bert=False,
seed=42,
bert_name="camembert/camembert-base",
fasttext_file="",
gpus=0,
xp_name="my-xp",
)
model.save_pretrained("model.pt")
How to use it
from nlstruct import load_pretrained
from nlstruct.datasets import load_from_brat, export_to_brat
ner = load_pretrained("model.pt")
export_to_brat(ner.predict(load_from_brat("path/to/brat/test")), filename_prefix="path/to/exported_brat")
How to train a NER model followed by a span classification model
from nlstruct.recipes import train_qualified_ner
model = train_qualified_ner(
dataset={
"train": "path to your train brat/standoff data",
"val": 0.05, # or path to your validation data
# "test": # and optional path to your test data
},
finetune_bert=False,
seed=42,
bert_name="camembert/camembert-base",
fasttext_file="",
gpus=0,
xp_name="my-xp",
)
model.save_pretrained("model.pt")
Ensembling
Easily ensemble multiple models (same architecture, different seeds):
model1 = load_pretrained("model-1.pt")
model2 = load_pretrained("model-2.pt")
model3 = load_pretrained("model-3.pt")
ensemble = model1.ensemble_with([model2, model3]).cuda()
export_to_brat(ensemble.predict(load_from_brat("path/to/brat/test")), filename_prefix="path/to/exported_brat")
Advanced use
Should you need to further configure the training of a model, please modify directly one of the recipes located in the recipes folder.
Install
This project is still under development and subject to changes.
pip install nlstruct==0.1.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nlstruct-0.1.0.tar.gz
(89.3 kB
view details)
Built Distribution
nlstruct-0.1.0-py3-none-any.whl
(103.1 kB
view details)
File details
Details for the file nlstruct-0.1.0.tar.gz
.
File metadata
- Download URL: nlstruct-0.1.0.tar.gz
- Upload date:
- Size: 89.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99445a37e2380bf8615be26eb97cf73e9ede5d9007c428f9d5298ce5fcbe7f25 |
|
MD5 | d9ecf0be6dbd78133c7008e02c50d23d |
|
BLAKE2b-256 | 5cd7678ccd758944b1359da17d0d7633600c9d59d7a3a0abdb2d585f999f46ec |
File details
Details for the file nlstruct-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: nlstruct-0.1.0-py3-none-any.whl
- Upload date:
- Size: 103.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8784b9aa0f3cf6bf089a1b9c9c2b2001f2d1620b2b1eea165d7c0c6003d438f |
|
MD5 | 09bcb0489562bee98bb3fdf78af4a7aa |
|
BLAKE2b-256 | 798603f7730fe26db815e2b619f12076559721f7c2964903a37f8336f319be40 |