A tookit for event extraction.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

A comprehensive, unified and modular event extraction toolkit.

Table of Contents
Overview
- Highlights
Installation
Easy Start
Train your Own Model with OmniEvent
Supported Datasets & Models
- Datasets
- Models
- Contests
Experiments

Overview

OmniEvent is a powerful open-source toolkit for event extraction, including event detection and event argument extraction. We comprehensively cover various paradigms and provide fair and unified evaluations on widely-used English and Chinese datasets. Modular implementations make OmniEvent highly extensible.

Highlights

Comprehensive Capability
- Support to do Event Extraction at once, and also to independently do its two subtasks: Event Detection, Event Argument Extraction.
- Cover various paradigms: Token Classification, Sequence Labeling, MRC(QA) and Seq2Seq.
- Implement Transformer-based (BERT, T5, etc.) and classical (DMCNN, CRF, etc.) models.
- Both Chinese and English are supported for all event extraction sub-tasks, paradigms and models.
Unified Benchmark & Evaluation
- Various datasets are processed into a unified format.
- Predictions of different paradigms are all converted into a unified candidate set for fair evaluations.
- Four evaluation modes (gold, loose, default, strict) well cover different previous evaluation settings.
Modular Implementation
- All models are decomposed into four modules:
  - Input Engineering: Prepare inputs and support various input engineering methods like prompting.
  - Backbone: Encode text into hidden states.
  - Aggregation: Fuse hidden states (e.g., select [CLS], pooling, GCN) to the final event representation.
  - Output Head: Map the event representation to the final outputs, such as Linear, CRF, MRC head, etc.
- You can combine and reimplement different modules to design and implement your own new model.
Big Model Training & Inference
- Efficient training and inference of big event extraction models are supported with BMTrain.
Easy to Use & Highly Extensible
- Open datasets can be downloaded and processed with a single command.
- Fully compatible with 🤗 Transformers and its Trainer.
- Users can easily reproduce existing models and build customized models with OmniEvent.

Installation

With pip

This repository is tested on Python 3.9+, Pytorch 1.12.1+. OmniEvent can be installed with pip as follows:

pip install OmniEvent

From source

If you want to install the repository from local source, you can install as follows:

pip install .

And if you want to edit the repositoy, you can

pip install -e .

Easy Start

OmniEvent provides several off-the-shelf models for the users. Examples are shown below.

Make sure you have installed OmniEvent as instructed above. Note that it may take a few minutes to download checkpoint at the first time.

>>> from OmniEvent.infer import infer

>>> # Even Extraction (EE) Task
>>> text = "2022年北京市举办了冬奥会"
>>> results = infer(text=text, task="EE")
>>> print(results[0]["events"])
[
    {
        "type": "组织行为开幕", "trigger": "举办", "offset": [8, 10],
        "arguments": [
            {   "mention": "2022年", "offset": [9, 16], "role": "时间"},
            {   "mention": "北京市", "offset": [81, 89], "role": "地点"},
            {   "mention": "冬奥会", "offset": [0, 4], "role": "活动名称"},
        ]
    }
]

>>> text = "U.S. and British troops were moving on the strategic southern port city of Basra \ 
Saturday after a massive aerial assault pounded Baghdad at dawn"

>>> # Event Detection (ED) Task
>>> results = infer(text=text, task="ED")
>>> print(results[0]["events"])
[
    { "type": "attack", "trigger": "assault", "offset": [113, 120]},
    { "type": "injure", "trigger": "pounded", "offset": [121, 128]}
]

>>> # Event Argument Extraction (EAE) Task
>>> results = infer(text=text, triggers=[("assault", 113, 120), ("pounded", 121, 128)], task="EAE")
>>> print(results[0]["events"])
[
    {
        "type": "attack", "trigger": "assault", "offset": [113, 120],
        "arguments": [
            {   "mention": "U.S.", "offset": [0, 4], "role": "attacker"},
            {   "mention": "British", "offset": [9, 16], "role": "attacker"},
            {   "mention": "Saturday", "offset": [81, 89], "role": "time"}
        ]
    },
    {
        "type": "injure", "trigger": "pounded", "offset": [121, 128],
        "arguments": [
            {   "mention": "U.S.", "offset": [0, 4], "role": "attacker"},
            {   "mention": "Saturday", "offset": [81, 89], "role": "time"},
            {   "mention": "British", "offset": [9, 16], "role": "attacker"}
        ]
    }
]

Train your Own Model with OmniEvent

OmniEvent can help users easily train and evaluate their customized models on specific datasets.

We show a step-by-step example of using OmniEvent to train and evaluate an Event Detection model on ACE-EN dataset in the Seq2Seq paradigm. More examples are shown in examples.

Step 1: Process the dataset into the unified format

We provide standard data processing scripts for several commonly-used datasets. Checkout the details in scripts/data_processing.

dataset=ace2005-en  # the dataset name
cd scripts/data_processing/$dataset
bash run.sh

Step 2: Set up the customized configurations

We keep track of the configurations of dataset, model and training parameters via a single *.yaml file. See ./configs for details.

>>> from OmniEvent.arguments import DataArguments, ModelArguments, TrainingArguments, ArgumentParser
>>> from OmniEvent.input_engineering.seq2seq_processor import type_start, type_end

>>> parser = ArgumentParser((ModelArguments, DataArguments, TrainingArguments))
>>> model_args, data_args, training_args = parser.parse_yaml_file(yaml_file="config/all-datasets/ed/s2s/ace-en.yaml")

>>> training_args.output_dir = 'output/ACE2005-EN/ED/seq2seq/t5-base/'
>>> data_args.markers = ["<event>", "</event>", type_start, type_end]

Step 3: Initialize the model and tokenizer

OmniEvent supports various backbones. The users can specify the model and tokenizer in the config file and initialize them as follows.

>>> from OmniEvent.backbone.backbone import get_backbone
>>> from OmniEvent.model.model import get_model

>>> backbone, tokenizer, config = get_backbone(model_type=model_args.model_type, 
                           		       model_name_or_path=model_args.model_name_or_path, 
                           		       tokenizer_name=model_args.model_name_or_path, 
                           		       markers=data_args.markers,
                           		       new_tokens=data_args.markers)
>>> model = get_model(model_args, backbone)

Step 4: Initialize the dataset and evaluation metric

OmniEvent prepares the DataProcessor and the corresponding evaluation metrics for different task and paradigms.

Note that the metrics here are paradigm-dependent and are not used for the final unified evaluation.

>>> from OmniEvent.input_engineering.seq2seq_processor import EDSeq2SeqProcessor
>>> from OmniEvent.evaluation.metric import compute_seq_F1

>>> train_dataset = EDSeq2SeqProcessor(data_args, tokenizer, data_args.train_file)
>>> eval_dataset = EDSeq2SeqProcessor(data_args, tokenizer, data_args.validation_file)
>>> metric_fn = compute_seq_F1

Step 5: Define Trainer and train

OmniEvent adopts Trainer from 🤗 Transformers for training and evaluation.

>>> from OmniEvent.trainer_seq2seq import Seq2SeqTrainer

>>> trainer = Seq2SeqTrainer(
        args=training_args,
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=metric_fn,
        data_collator=train_dataset.collate_fn,
        tokenizer=tokenizer,
    )
>>> trainer.train()

Step 6: Unified Evaluation

Since the metrics in Step 4 depend on the paradigm, it is not fair to directly compare the performance of models in different paradigms.

OmniEvent evaluates models of different paradigms in a unified manner, where the predictions of different models are converted to predictions on the same candidate sets and then evaluated.

>>> from OmniEvent.evaluation.utils import predict, get_pred_s2s
>>> from OmniEvent.evaluation.convert_format import get_ace2005_trigger_detection_s2s

>>> logits, labels, metrics, test_dataset = predict(trainer=trainer, tokenizer=tokenizer, data_class=EDSeq2SeqProcessor,
                                                    data_args=data_args, data_file=data_args.test_file,
                                                    training_args=training_args)
>>> # paradigm-dependent metrics
>>> print("{} test performance before converting: {}".formate(test_dataset.dataset_name, metrics["test_micro_f1"]))  
ACE2005-EN test performance before converting: 66.4215686224377

>>> preds = get_pred_s2s(logits, tokenizer)
>>> # convert to the unified prediction and evaluate
>>> pred_labels = get_ace2005_trigger_detection_s2s(preds, labels, data_args.test_file, data_args, None)
ACE2005-EN test performance after converting: 67.41016109045849

For those datasets whose test set annotations are not public, such as MAVEN and LEVEN, OmniEvent provide scripts to generate submission files. See dump_result.py for details.

Supported Datasets & Models & Contests

Continually updated. Welcome to add more!

Datasets

Language	Domain	Task	Dataset
English	General	ED	MAVEN
	General	ED EAE	ACE-EN
	General	ED EAE	ACE-DYGIE
	General	ED EAE	RichERE (KBP+ERE)
Chinese	Legal	ED	LEVEN
	General	ED EAE	DuEE
	General	ED EAE	ACE-ZH
	Financial	ED EAE	FewFC

Models

Paradigm
- Token Classification (TC)
- Sequence Labeling (SL)
- Sequence to Sequence (Seq2Seq)
- Machine Reading Comprehension (MRC)
Backbone
- CNN / LSTM
- Transformers (BERT, T5, etc.)
Aggregation
- Select [CLS]
- Dynamic/Max Pooling
- Marker
- GCN
Head
- Linear / CRF / MRC heads

Contests

OmniEvent plans to support various event extraction contest. Currently, we support the following contests and the list is continually updated!

Experiments

We implement and evaluate state-of-the-art methods on some popular benchmarks using OmniEvent.

The results of all Event Detection experiments are shown in the table below.

The full results can be accessed via the links below.

Language	Domain	Benchmark	Paradigm	Dev F1-score		Test F1-score
Language	Domain	Benchmark	Paradigm	Paradigm-based	Unified	Paradigm-based	Unified
English	General	MAVEN	TC	--	68.80	--	68.64
			SL	66.75	67.90	--	68.64
			S2S	61.23	61.86	--	61.86
	General	ACE-EN	TC	--	80.47	--	74.13
			SL	77.72	79.44	74.86	75.63
			S2S	75.88	76.73	73.09	72.97
	General	ACE-dygie	TC	--	73.61	--	68.63
			SL	71.58	71.75	68.63	68.63
			S2S	71.61	72.08	65.41	65.99
	General	RichERE	TC	--	68.75	--	51.43
			SL	68.46	66.05	50.13	50.77
			S2S	63.21	62.74	50.07	51.35
Chinese	General	ACE-ZH	TC	--	79.76	--	75.77
			SL	75.41	75.88	72.23	75.93
			S2S	69.45	73.17	63.37	71.61
	General	DuEE	TC	--	92.20	--	--
			SL	85.95	89.62	--	--
			S2S	81.61	85.85	--	--
	Legal	LEVEN	TC	--	85.18	--	85.23
			SL	81.09	84.16	--	84.66
			S2S	78.14	81.29	--	81.41
	Financial	FewFC	TC	--	69.28	--	67.15
			SL	71.13	63.75	68.99	62.31
			S2S	69.89	74.46	69.16	71.33

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.7

Jul 19, 2023

0.1.6

May 15, 2023

0.1.5 yanked

Sep 13, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

OmniEvent-0.1.7.tar.gz (78.5 kB view hashes)

Uploaded Jul 19, 2023 Source

Built Distribution

OmniEvent-0.1.7-py3-none-any.whl (91.4 kB view hashes)

Uploaded Jul 19, 2023 Python 3

Hashes for OmniEvent-0.1.7.tar.gz

Hashes for OmniEvent-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`3f961e82f86eea5d9e8197d5952c7939ef0586c332e351919f9d5d03427da265`
MD5	`a2623ff6611dfeede507e51c4d467a48`
BLAKE2b-256	`5ef1e61dd59ce86ab8bdf692e7f6eeb179b93300ae94573d2a1f36a4941defd3`

Hashes for OmniEvent-0.1.7-py3-none-any.whl

Hashes for OmniEvent-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab46c838d443082e540c37c268cd7ac9fe595b704ef1bfdf8a8de5b54f79a6ed`
MD5	`9d8a63bdb05547048035bbffffca9cfa`
BLAKE2b-256	`e6b3ba73696b180c4086b5171c19f3eb7d2600565ea8dd788029c43af70adf7a`

OmniEvent 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Table of Contents

Overview

Highlights

Installation

With pip

From source

Easy Start

Train your Own Model with OmniEvent

Step 1: Process the dataset into the unified format

Step 2: Set up the customized configurations

Step 3: Initialize the model and tokenizer

Step 4: Initialize the dataset and evaluation metric

Step 5: Define Trainer and train

Step 6: Unified Evaluation

Supported Datasets & Models & Contests

Datasets

Models

Contests

Experiments

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution