Skip to main content

No project description provided

Project description

Hugging Face Dataset Structurer

Hugging Face Dataset Structurer is a Python wrapper to simplify the process of deploying multi-config datasets to the Hugging Face Hub. We developed this tool because we found the official documentation to be lacking in detail and creating the perception the process requires a manual step to develop a dataset loading script. You don't need to do that! This tool will do it for you.

Installation

pip install -U hf-dataset-structurer

Quickstart

In this example, we will create a bundle for Portuguese NER. Official HuggingFace HAREM dataset entry it's true name is "first-HAREM". We will create a bundle attaching the second-HAREM available here. This dataset has two labelling schemes, DEFAULT and SELECTIVE. Selective is just a coarse-grained version of DEFAULT. These two schemes turn this dataset a good candidate to demonstrate the power of this tool.

from datasets import load_dataset, concatenate_datasets, DatasetDict 
from hf_dataset_structurer.DatasetStructure import DatasetStructure

structurer = DatasetStructure("<<TARGET Hugging Face Dataset Name>>")

# Iterate both Labelling Schemes
for config in ["default", "selective"]:
    # Load Official HAREM Dataset
    primeiro_harem = load_dataset("harem", config)
    
    # Start Structuring Process
    structurer.add_dataset(primeiro_harem['train'], f"primeiro_harem_{config}", split="train")

# Load Second HAREM Datasets

second_harem_default = load_dataset("arubenruben/segundo_harem_default")
second_harem_selective = load_dataset("arubenruben/segundo_harem_selective")

# Notice the function used now is add_dataset_dict. A [DatasetDict](https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.DatasetDict) is a native HuggingFace object that represents a dictionary of datasets.
structurer.add_dataset_dict(second_harem_default, "segundo_harem_default")
structurer.add_dataset_dict(second_harem_selective, "segundo_harem_selective")

# Push to Hugging Face Hub
structurer.push_to_hub()

# After creating the bundle. You can append a dataset card to it.
# Create Dataset Card to describe the dataset
structurer.attach_dataset_card(
    language="pt",
    license="cc-by-4.0",
    annotations_creators=["expert-generated"],
    task_categories=["token-classification"],
    tasks_ids=["named-entity-recognition"],
    pretty_name="HAREM",
    multilinguality='monolingual'
)

API Reference

# Initializes a new instance of the DatasetStructure class.
__init__(self, repo_name: str) -> None

# Accepts a DatasetDict and a config_name and adds it to the dataset structure.
add_dataset_dict(self, dataset_dict: DatasetDict, config_name: str) -> None

# Similar to add_dataset_dict, but accepts a Dataset and a split. Internally, it creates a DatasetDict and calls add_dataset_dict.
add_dataset(self, dataset: Dataset, config_name: str, split: str = "train") -> None

# Attaches a dataset card to the dataset structure.
attach_dataset_card(self, language: str,
                    license: str,
                    annotations_creators: str,
                    task_categories: str,
                    tasks_ids: str,
                    pretty_name: str,
                    multilinguality: str = 'monolingual') -> None

# Pushes the dataset structure and dataset card to the Hugging Face Hub.
push_to_hub(self, private: bool = False) -> None

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Acknowledgements

This tool was developed by Ruben Almeida as part of the Project PT-Pump-Up. PT-Pump-Up is a project funded by INESC TEC and the Portuguese Government through the Fundação para a Ciência e a Tecnologia (FCT) that aims to build Portuguese NLP resources and tools to support the development of NLP applications for Portuguese.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_dataset_structurer-0.0.4.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

hf_dataset_structurer-0.0.4-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file hf_dataset_structurer-0.0.4.tar.gz.

File metadata

  • Download URL: hf_dataset_structurer-0.0.4.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for hf_dataset_structurer-0.0.4.tar.gz
Algorithm Hash digest
SHA256 7ea29a0ce9005162154f34766198506d2c8cc6bdde57e1e23cbc0932dad5d216
MD5 3eb7d5fa702512a7bdb968cdd8fdf681
BLAKE2b-256 12f41abeb59613c4d7f7a8e188cdc0e09ab9a889f90cd1ad840d78cdc65d4d0a

See more details on using hashes here.

Provenance

File details

Details for the file hf_dataset_structurer-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for hf_dataset_structurer-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5fb9f3e562eabe425ffde813f8594bdb0e838ca1e80624aba00fe4a6eb0ae084
MD5 9f60be6330b90309381c7b666a8c465c
BLAKE2b-256 51dccf3745ae8521637ce45fdd497a3267b3e085430d09e882a1453ce328a648

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page