petdatasetreader

Convenient interface that provides structured representations of the PET dataset hosted on Huggingface

These details have not been verified by PyPI

Project links

Homepage

Project description

A structured interface to interact with the PET-dataset hosted on huggingface.

Interacting with the data hosted on HuggingFace could be difficult since the data has a strict format. For example, getting the list of PET activities of a PET document requires a user to create a custom script that scans the dataset, extracts the words and their NER tags, and combines them. In addition, documents are stored in the different, non-always continuous samples in the HuggingFace dataset. Thus, conducting experiments with the PET Dataset could become a time-intensive operation. To alleviate such difficulties, we developed the PET dataset reader, a Python package that makes the interaction with the dataset easy. This package is composed of three different modules: TokenClassification module, RelationExtraction module, and ProcessInformation module.

TokenClassification Module

This module is composed of a Python class that allows users to extract structured information at the token levels. This class has specific methods to get all the PET elements of a specific category. We briefly introduce the principal methods implemented in this module.

GetDocumentNames This method returns a list of the document names of the dataset.
GetDocumentText This method returns the textual description of a document.
GetTokens This method returns the text of a sentence in the form of a list of words of a given sentence ID.
GetNerTagLabels This method provides the list of NER tags of a sentence, document, or entire dataset. Since the NER tags are stored as numbers in the dataset, we created specific methods to convert the number into a textual tag. For example, the method emph{GetPrefixAndLabel} returns the NER marker (B, I, or O) and the tag text (e.g., Activity) of a specific NER tag number.
Statistics This method provides the statistics about the PET elements annotated.

In addition, specific methods were implemented to get the list of elements of a given category. For example, the method emph{GetActivity} returns all the PETactivity of a specific document or the entire dataset. Similarly, the method emph{GetActivityData} returns the PETactivitydata.

RelationExtraction Module

This module is composed of a Python class that allows users to extract structured information about the PET relations annotated in the dataset, e.g., PET Uses relation. This class has specific methods to get all the PET relations of a specific category. We briefly introduce the principal methods implemented in this module.

GetNerLabels This method returns the NER tag IDs of a given document.
GetRelations This method provides a list of PET relations of a given document.
GetSentencesWithIdsAndNerTagLabels This method provides a user with a list of sentences composed of word tokens and the corresponding NER tags.
Statistics This method provides the statistics about the PET relations.

ProcessInformation Module

This module contains the methods developed to obtain a structured representation of a document in the form of a graph, e.g., in the form of a Directly Follows Graph. The module has six main methods:

GetRawActivityLabels returns the activity labels (PET activity + PET Acitity Data) as their are annotated in the text.
GetDFG returns the directlyfollows graph representation of the annotations of a document. This graph is composed of behavioral elements only.
GetKG_DFGActivityData provides the DFG representation of a document enhanced with the PETactivitydata elements.
GetKG_DFGPerformsActors provides the DFG graph representation of a document enhanced with the actorperformer information.
GetPerformsActors returns a graph representation of the DFG graph of a document enhanced with actorperformer relations.
GetKnowledgeGraph returns a graph representation of a document representing the information about the behavioral elements, the activity data elements, and the actor performer elements.

How to Load the PET dataset

Token-classification task

from datasets import load_dataset

modelhub_dataset = load_dataset("patriziobellan/PET", name='token-classification')

Relations-extraction task

from datasets import load_dataset

modelhub_dataset = load_dataset("patriziobellan/PET", name='relations-extraction')

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.2

May 23, 2024

0.0.2a10 pre-release

Nov 9, 2022

0.0.2a9 pre-release

Oct 27, 2022

0.0.1

Jul 15, 2022

0.0.1a1 pre-release

Jun 20, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

petdatasetreader-0.0.2.tar.gz (9.8 kB view details)

Uploaded May 23, 2024 Source

File details

Details for the file petdatasetreader-0.0.2.tar.gz.

File metadata

Download URL: petdatasetreader-0.0.2.tar.gz
Upload date: May 23, 2024
Size: 9.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for petdatasetreader-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`6b8cbe23f511228e6182571d71556ef5b97886872a5d878be46b274dad6e9645`
MD5	`fb55b5e82bc4b5968e2eb527a13e041b`
BLAKE2b-256	`cac5ebd19aef1363df7e536e3c01400f485c23bac47427b49cbec91709ee1d1c`

See more details on using hashes here.

petdatasetreader 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TokenClassification Module

RelationExtraction Module

ProcessInformation Module

How to Load the PET dataset

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes