Skip to main content

Convenient interface that provides structured representations of the PET dataset hosted on Huggingface

Project description

A structured interface to interact with the PET-dataset hosted on huggingface.

Created by Patrizio Bellan.


Interacting with the data hosted on HuggingFace could be difficult since the data has a strict format. For example, getting the list of PET activities of a PET document requires a user to create a custom script that scans the dataset, extracts the words and their NER tags, and combines them. In addition, documents are stored in the different, non-always continuous samples in the HuggingFace dataset. Thus, conducting experiments with the PET Dataset could become a time-intensive operation. To alleviate such difficulties, we developed the PET dataset reader, a Python package that makes the interaction with the dataset easy. This package is composed of three different modules: TokenClassification module, RelationExtraction module, and ProcessInformation module.

TokenClassification Module

This module is composed of a Python class that allows users to extract structured information at the token levels. This class has specific methods to get all the PET elements of a specific category. We briefly introduce the principal methods implemented in this module.

  1. GetDocumentNames This method returns a list of the document names of the dataset.

  2. GetDocumentText This method returns the textual description of a document.

  3. GetTokens This method returns the text of a sentence in the form of a list of words of a given sentence ID.

  4. GetNerTagLabels This method provides the list of NER tags of a sentence, document, or entire dataset. Since the NER tags are stored as numbers in the dataset, we created specific methods to convert the number into a textual tag. For example, the method emph{GetPrefixAndLabel} returns the NER marker (B, I, or O) and the tag text (e.g., Activity) of a specific NER tag number.

  5. Statistics This method provides the statistics about the PET elements annotated.

In addition, specific methods were implemented to get the list of elements of a given category. For example, the method emph{GetActivity} returns all the PETactivity of a specific document or the entire dataset. Similarly, the method emph{GetActivityData} returns the PETactivitydata.

RelationExtraction Module

This module is composed of a Python class that allows users to extract structured information about the PET relations annotated in the dataset, e.g., PET Uses relation. This class has specific methods to get all the PET relations of a specific category. We briefly introduce the principal methods implemented in this module.

  1. GetNerLabels This method returns the NER tag IDs of a given document.

  2. GetRelations This method provides a list of PET relations of a given document.

  3. GetSentencesWithIdsAndNerTagLabels This method provides a user with a list of sentences composed of word tokens and the corresponding NER tags.

  4. Statistics This method provides the statistics about the PET relations.

ProcessInformation Module

This module contains the methods developed to obtain a structured representation of a document in the form of a graph, e.g., in the form of a Directly Follows Graph. The module has six main methods:

  1. GetRawActivityLabels returns the activity labels (PET activity + PET Acitity Data) as their are annotated in the text.

  2. GetDFG returns the directlyfollows graph representation of the annotations of a document. This graph is composed of behavioral elements only.

  3. GetKG_DFGActivityData provides the DFG representation of a document enhanced with the PETactivitydata elements.

  4. GetKG_DFGPerformsActors provides the DFG graph representation of a document enhanced with the actorperformer information.

  5. GetPerformsActors returns a graph representation of the DFG graph of a document enhanced with actorperformer relations.

  6. GetKnowledgeGraph returns a graph representation of a document representing the information about the behavioral elements, the activity data elements, and the actor performer elements.

How to Load the PET dataset

Token-classification task

from datasets import load_dataset

modelhub_dataset = load_dataset("patriziobellan/PET", name='token-classification')

Relations-extraction task

from datasets import load_dataset

modelhub_dataset = load_dataset("patriziobellan/PET", name='relations-extraction')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

petdatasetreader-0.0.2.tar.gz (9.8 kB view details)

Uploaded Source

File details

Details for the file petdatasetreader-0.0.2.tar.gz.

File metadata

  • Download URL: petdatasetreader-0.0.2.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for petdatasetreader-0.0.2.tar.gz
Algorithm Hash digest
SHA256 6b8cbe23f511228e6182571d71556ef5b97886872a5d878be46b274dad6e9645
MD5 fb55b5e82bc4b5968e2eb527a13e041b
BLAKE2b-256 cac5ebd19aef1363df7e536e3c01400f485c23bac47427b49cbec91709ee1d1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page