State-of-the-art Information Extraction in PyTorch
Project description
PyTorch-IE: State-of-the-art Information Extraction in PyTorch
🤯 What's this about?
This is an experimental framework that aims to combine the lessons learned from five years of information extraction research.
- Focus on the core task: The main goal is to develop information extraction methods not dataset loading and evaluation logic. We use external well-maintained libraries for non-core functionality. PyTorch-Lightning for training and logging, Huggingface datasets for dataset reading, and Huggingface evaluate for evaluation (coming soon).
- Sharing is caring: Being able to quickly and easily share models is key to promote your work and facilitate further research. All models developed in PyTorch-IE can be easily shared via the Huggingface model hub. This further allows to quickly build demos based on Huggingface spaces, gradio or streamlit.
- Unified document format: A unified document format allows for quick experimentation on any dataset or task.
- Beyond sentence level: Most information extraction frameworks assume text inputs at a sentence granularity. We do not make any assumption on the granularity but generally aim for document-level information extraction.
- Beyond unstructured text: Unstructured text is only one possible area for information extraction. We developed the framework to also support information extraction from semi-structured text (e.g. HTML), two-dimensional text (e.g. OCR'd images), and images.
- Character-level annotation and evaluation: Many information extraction frameworks annotate and evaluate on a token level. We believe that annotation and evaluation should be done on a character level as this also considers the suitability of the tokenizer for the task.
- Make no assumptions on the structure of models: The last years have seen many different and creative approaches to information extraction and a framework that imposes a structure on those will most certainly be to limiting. With PyTorch-iE you have full control over how a document is prepared for a model and how the model is structured. The logic is self-contained and thus can be easily shared and inspected by others. The only assumption we make is that the input is a document and the output are targets (training) or annotations (inference).
🚀️ Quickstart
$ pip install pytorch-ie
🔭 Demos
Task | Link (Huggingface Spaces) |
---|---|
Named Entity Recognition (Span-based) | LINK |
Joint Named Entity Recognition and Relation Classification | LINK |
📚 Datasets
We parse all datasets into a common format that can be loaded directly from the model hub via Huggingface datasets. The documents are cached in an arrow table and serialized / deserialized on the fly. Any changes or preprocessing applied to the documents will be cached as well.
import datasets
dataset = datasets.load_dataset("pie/conll2003")
print(dataset["train"][0])
# >>> CoNLL2003Document(text='EU rejects German call to boycott British lamb .', id='0', metadata={})
dataset["train"][0].entities
# >>> AnnotationList([LabeledSpan(start=0, end=2, label='ORG', score=1.0), LabeledSpan(start=11, end=17, label='MISC', score=1.0), LabeledSpan(start=34, end=41, label='MISC', score=1.0)])
entity = dataset["train"][0].entities[1]
print(f"[{entity.start}, {entity.end}] {entity}")
# >>> [11, 17] German
⚡️ Example
Note: Setting num_workers=0
in the pipeline is only necessary when running an example in an
interactive python session. The reason is that multiprocessing doesn't play well with the interactive python
interpreter, see here
for details.
Span-classification-based Named Entity Recognition
from dataclasses import dataclass
from pytorch_ie.annotations import LabeledSpan
from pytorch_ie.auto import AutoPipeline
from pytorch_ie.core import AnnotationList, annotation_field
from pytorch_ie.documents import TextDocument
@dataclass
class ExampleDocument(TextDocument):
entities: AnnotationList[LabeledSpan] = annotation_field(target="text")
document = ExampleDocument(
"“Making a super tasty alt-chicken wing is only half of it,” said Po Bronson, general partner at SOSV and managing director of IndieBio."
)
# see below for the long version
ner_pipeline = AutoPipeline.from_pretrained("pie/example-ner-spanclf-conll03", device=-1, num_workers=0)
ner_pipeline(document, predict_field="entities")
for entity in document.entities.predictions:
print(f"{entity} -> {entity.label}")
# Result:
# IndieBio -> ORG
# Po Bronson -> PER
# SOSV -> ORG
To create the same pipeline as above without AutoPipeline
:
from pytorch_ie.auto import AutoTaskModule, AutoModel
from pytorch_ie.pipeline import Pipeline
model_name_or_path = "pie/example-ner-spanclf-conll03"
ner_taskmodule = AutoTaskModule.from_pretrained(model_name_or_path)
ner_model = AutoModel.from_pretrained(model_name_or_path)
ner_pipeline = Pipeline(model=ner_model, taskmodule=ner_taskmodule, device=-1, num_workers=0)
Or, without Auto
classes at all:
from pytorch_ie.pipeline import Pipeline
from pytorch_ie.models import TransformerSpanClassificationModel
from pytorch_ie.taskmodules import TransformerSpanClassificationTaskModule
model_name_or_path = "pie/example-ner-spanclf-conll03"
ner_taskmodule = TransformerSpanClassificationTaskModule.from_pretrained(model_name_or_path)
ner_model = TransformerSpanClassificationModel.from_pretrained(model_name_or_path)
ner_pipeline = Pipeline(model=ner_model, taskmodule=ner_taskmodule, device=-1, num_workers=0)
⚡️️️️ More Examples
Text-classification-based Relation Extraction
from dataclasses import dataclass
from pytorch_ie.annotations import BinaryRelation, LabeledSpan
from pytorch_ie.auto import AutoPipeline
from pytorch_ie.core import AnnotationList, annotation_field
from pytorch_ie.documents import TextDocument
@dataclass
class ExampleDocument(TextDocument):
entities: AnnotationList[LabeledSpan] = annotation_field(target="text")
relations: AnnotationList[BinaryRelation] = annotation_field(target="entities")
document = ExampleDocument(
"“Making a super tasty alt-chicken wing is only half of it,” said Po Bronson, general partner at SOSV and managing director of IndieBio."
)
re_pipeline = AutoPipeline.from_pretrained("pie/example-re-textclf-tacred", device=-1, num_workers=0)
for start, end, label in [(65, 75, "PER"), (96, 100, "ORG"), (126, 134, "ORG")]:
document.entities.append(LabeledSpan(start=start, end=end, label=label))
re_pipeline(document, predict_field="relations", batch_size=2)
for relation in document.relations.predictions:
print(f"({relation.head} -> {relation.tail}) -> {relation.label}")
# Result:
# (Po Bronson -> SOSV) -> per:employee_of
# (Po Bronson -> IndieBio) -> per:employee_of
# (SOSV -> Po Bronson) -> org:top_members/employees
# (IndieBio -> Po Bronson) -> org:top_members/employees
✨📚✨ Read the full documentation
🔧 Development Setup
🏅 Acknowledgements
- This package is based on the sourcery-ai/python-best-practices-cookiecutter and cjolowicz/cookiecutter-hypermodern-python project templates.
📃 Citation
If you find the framework useful please consider citing it:
@misc{alt2022pytorchie,
author={Christoph Alt, Arne Binder},
title = {PyTorch-IE: State-of-the-art Information Extraction in PyTorch},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ChristophAlt/pytorch-ie}}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pytorch_ie-0.7.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 667f528f2c70f1e4e8f0fefa97ba29c530ec533b345d191fdb3a2c43116b4c27 |
|
MD5 | 480301ed59e880b27fc33b32e0369839 |
|
BLAKE2b-256 | 65acc903e7dc77faaace50c2069a4bd2a5a328ba75963f8fe99710b6e51b1f1c |