Building scripts for PyTorch-IE Datasets

These details have not been verified by PyPI

Project links

Project description

pie-datasets

Building Scripts for PyTorch-IE Datasets, also see here.

Setup

pip install pie-datasets

To install the latest version from GitHub:

pip install git+https://git@github.com/ArneBinder/pie-datasets.git

Usage

Use a PIE dataset

import datasets

dataset = datasets.load_dataset("pie/conll2003")

print(dataset["train"][0])
# >>> CoNLL2003Document(text='EU rejects German call to boycott British lamb .', id='0', metadata={})

dataset["train"][0].entities
# >>> AnnotationLayer([LabeledSpan(start=0, end=2, label='ORG', score=1.0), LabeledSpan(start=11, end=17, label='MISC', score=1.0), LabeledSpan(start=34, end=41, label='MISC', score=1.0)])

entity = dataset["train"][0].entities[1]

print(f"[{entity.start}, {entity.end}] {entity}")
# >>> [11, 17] German

Available PIE datasets

See here for a list of available datasets.

How to create your own PIE dataset

PIE datasets are built on top of Huggingface datasets. For instance, consider the conll2003 from the Huggingface Hub and especially their respective dataset loading script. To create a PIE dataset from that, you have to implement:

A Document class. This will be the type of individual dataset examples.

from dataclasses import dataclass

from pytorch_ie.annotations import LabeledSpan
from pytorch_ie.core import AnnotationLayer, annotation_field
from pytorch_ie.documents import TextBasedDocument

@dataclass
class CoNLL2003Document(TextBasedDocument):
    entities: AnnotationLayer[LabeledSpan] = annotation_field(target="text")

Here we derive from TextBasedDocument that has a simple text string as base annotation target. The CoNLL2003Document adds one single annotation layer called entities that consists of LabeledSpans which reference the text field of the document. You can add further annotation types by adding AnnotationLayer fields that may also reference (i.e. target) other annotations as you like. The package pytorch_ie.annotations contains some predefined annotation types and the package pytorch_ie.documents defines some document types that you can use as base classes.

A dataset config. This is similar to creating a Huggingface dataset config.

import datasets

class CoNLL2003Config(datasets.BuilderConfig):
    """BuilderConfig for CoNLL2003"""

    def __init__(self, **kwargs):
        """BuilderConfig for CoNLL2003.
        Args:
          **kwargs: keyword arguments forwarded to super.
        """
        super().__init__(**kwargs)

A dataset builder class. This should inherit from pie_datasets.GeneratorBasedBuilder which is a wrapper around the Huggingface dataset builder class with some utility functionality to work with PyTorch-IE Documents. The key elements to implement are: DOCUMENT_TYPE, BASE_DATASET_PATH, and _generate_document.

from pytorch_ie.utils.span import tokens_and_tags_to_text_and_labeled_spans
from pie_datasets import GeneratorBasedBuilder

class Conll2003(GeneratorBasedBuilder):
    # Specify the document type. This will be the class of individual dataset examples.
    DOCUMENT_TYPE = CoNLL2003Document

    # The Huggingface identifier that points to the base dataset. This may be any string that works
    # as path with Huggingface `datasets.load_dataset`.
    BASE_DATASET_PATH = "conll2003"

    # The builder configs, see https://huggingface.co/docs/datasets/dataset_script for further information.
    BUILDER_CONFIGS = [
        CoNLL2003Config(
            name="conll2003", version=datasets.Version("1.0.0"), description="CoNLL2003 dataset"
        ),
    ]

    # [Optional] Define additional keyword arguments which will be passed to `_generate_document` below.
    def _generate_document_kwargs(self, dataset):
        return {"int_to_str": dataset.features["ner_tags"].feature.int2str}

    # Define how a Pytorch-IE Document will be created from a Huggingface dataset example.
    def _generate_document(self, example, int_to_str):
        doc_id = example["id"]
        tokens = example["tokens"]
        ner_tags = [int_to_str(tag) for tag in example["ner_tags"]]

        text, ner_spans = tokens_and_tags_to_text_and_labeled_spans(tokens=tokens, tags=ner_tags)

        document = CoNLL2003Document(text=text, id=doc_id)

        for span in sorted(ner_spans, key=lambda span: span.start):
            document.entities.append(span)

        return document

The full script can be found here: dataset_builders/pie/conll2003/conll2003.py. Note, that to load the dataset with datasets.load_dataset, the script has to be located in a directory with the same name (as it is the case for standard Huggingface dataset loading scripts).

Development

Setup

This project is build with Poetry. See here for installation instructions.

Get the code and switch into the project directory:

git clone https://github.com/ArneBinder/pie-datasets
cd pie-datasets

Create a virtual environment and install the dependencies:
```
poetry install
```

Finally, to run any of the below commands, you need to activate the virtual environment:

poetry shell

Note: You can also run commands in the virtual environment without activating it first: poetry run <command>.

Code Formatting, Linting and Static Type Checking

pre-commit run -a

Testing

run all tests with coverage:

pytest --cov --cov-report term-missing

Releasing

Create the release branch: git switch --create release main
Increase the version: poetry version <PATCH|MINOR|MAJOR>, e.g. poetry version patch for a patch release. If the release contains new features, or breaking changes, bump the minor version (this project has no main release yet). If the release contains only bugfixes, bump the patch version. See Semantic Versioning for more information.
Commit the changes: git commit --message="release <NEW VERSION>" pyproject.toml, e.g. git commit --message="release 0.13.0" pyproject.toml
Push the changes to GitHub: git push origin release
Create a PR for that release branch on GitHub.
Wait until checks passed successfully.
Merge the PR into the main branch. This triggers the GitHub Action that creates all relevant release artefacts and also uploads them to PyPI.
Cleanup: Delete the release branch. This is important, because otherwise the next release will fail.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.10.6

Nov 4, 2024

0.10.5

Oct 1, 2024

0.10.4

Sep 3, 2024

0.10.3

Aug 6, 2024

0.10.2

Jun 17, 2024

0.10.1

May 30, 2024

0.10.0

Apr 17, 2024

0.9.0

Mar 21, 2024

0.8.2

Feb 20, 2024

0.8.1

Dec 27, 2023

0.8.0

Dec 10, 2023

0.7.0

Nov 27, 2023

0.6.0

Nov 26, 2023

0.5.0

Nov 26, 2023

0.4.0

Nov 23, 2023

This version

0.3.3

Nov 14, 2023

0.3.1

Nov 8, 2023

0.3.0

Nov 8, 2023

0.2.0

Nov 7, 2023

0.1.0

Oct 24, 2023

0.0.3

Oct 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pie_datasets-0.3.3.tar.gz (27.8 kB view details)

Uploaded Nov 14, 2023 Source

Built Distribution

pie_datasets-0.3.3-py3-none-any.whl (30.0 kB view details)

Uploaded Nov 14, 2023 Python 3

File details

Details for the file pie_datasets-0.3.3.tar.gz.

File metadata

Download URL: pie_datasets-0.3.3.tar.gz
Upload date: Nov 14, 2023
Size: 27.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pie_datasets-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`bc500c6c777da11c39ab34668ff5313cfb78af3e59e0ab187b9bd20b8276960d`
MD5	`3ec98f4a48badda2945df1e45950d749`
BLAKE2b-256	`25350e8da8197948ddf6b8736ffab09071e3cb0b5c325f87bbc7b9b4fa5a2760`

See more details on using hashes here.

File details

Details for the file pie_datasets-0.3.3-py3-none-any.whl.

File metadata

Download URL: pie_datasets-0.3.3-py3-none-any.whl
Upload date: Nov 14, 2023
Size: 30.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pie_datasets-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad00f7d83dfc15eb332074db7a69e2bcd1fe9a4506b146647e2c14569defca48`
MD5	`5718fcd6f1553b6a3de7484956d8e206`
BLAKE2b-256	`7655dcf4479e32c35ca7c49b5d9952f2522be8fb76c20b9481c1780ed22cd271`