Skip to main content

Convenient interface that provides structured representations of the PET dataset hosted on Huggingface

Project description

A structured interface to interact with the PET-dataset hosted on huggingface.

Created by Patrizio Bellan.


Interacting with the data hosted on HuggingFace could be difficult since the data has a strict format. For example, getting the list of PET activities of a PET document requires a user to create a custom script that scans the dataset, extracts the words and their NER tags, and combines them. In addition, documents are stored in the different, non-always continuous samples in the HuggingFace dataset. Thus, conducting experiments with the PET Dataset could become a time-intensive operation. To alleviate such difficulties, we developed the PET dataset reader, a Python package that makes the interaction with the dataset easy. This package is composed of three different modules: TokenClassification module, RelationExtraction module, and ProcessInformation module.

TokenClassification Module

This module is composed of a Python class that allows users to extract structured information at the token levels. This class has specific methods to get all the PET elements of a specific category. We briefly introduce the principal methods implemented in this module.

  1. GetDocumentNames This method returns a list of the document names of the dataset.

  2. GetDocumentText This method returns the textual description of a document.

  3. GetTokens This method returns the text of a sentence in the form of a list of words of a given sentence ID.

  4. GetNerTagLabels This method provides the list of NER tags of a sentence, document, or entire dataset. Since the NER tags are stored as numbers in the dataset, we created specific methods to convert the number into a textual tag. For example, the method emph{GetPrefixAndLabel} returns the NER marker (B, I, or O) and the tag text (e.g., Activity) of a specific NER tag number.

  5. Statistics This method provides the statistics about the PET elements annotated.

In addition, specific methods were implemented to get the list of elements of a given category. For example, the method emph{GetActivity} returns all the PETactivity of a specific document or the entire dataset. Similarly, the method emph{GetActivityData} returns the PETactivitydata.

RelationExtraction Module

This module is composed of a Python class that allows users to extract structured information about the PET relations annotated in the dataset, e.g., PET Uses relation. This class has specific methods to get all the PET relations of a specific category. We briefly introduce the principal methods implemented in this module.

  1. GetNerLabels This method returns the NER tag IDs of a given document.

  2. GetRelations This method provides a list of PET relations of a given document.

  3. GetSentencesWithIdsAndNerTagLabels This method provides a user with a list of sentences composed of word tokens and the corresponding NER tags.

  4. Statistics This method provides the statistics about the PET relations.

ProcessInformation Module

This module contains the methods developed to obtain a structured representation of a document in the form of a graph, e.g., in the form of a Directly Follows Graph. The module has six main methods:

  1. GetRawActivityLabels returns the activity labels (PET activity + PET Acitity Data) as their are annotated in the text.

  2. GetDFG returns the directlyfollows graph representation of the annotations of a document. This graph is composed of behavioral elements only.

  3. GetKG_DFGActivityData provides the DFG representation of a document enhanced with the PETactivitydata elements.

  4. GetKG_DFGPerformsActors provides the DFG graph representation of a document enhanced with the actorperformer information.

  5. GetPerformsActors returns a graph representation of the DFG graph of a document enhanced with actorperformer relations.

  6. GetKnowledgeGraph returns a graph representation of a document representing the information about the behavioral elements, the activity data elements, and the actor performer elements.

How to Load the PET dataset

Token-classification task

from datasets import load_dataset

modelhub_dataset = load_dataset("patriziobellan/PET", name='token-classification')

Relations-extraction task

from datasets import load_dataset

modelhub_dataset = load_dataset("patriziobellan/PET", name='relations-extraction')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

petdatasetreader-0.0.2.tar.gz (9.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page