Skip to main content

A package to structure datasets in the Hugging Face Hub

Project description

Hugging Face Dataset Structurer

The Hugging Face Dataset Structurer is a Python wrapper designed to streamline the deployment of multi-config datasets to the Hugging Face Hub. This tool simplifies the process by automating the creation of dataset loading scripts, addressing gaps in the official documentation.

Installation

pip install -U hf-dataset-structurer

Quickstart

Let's demonstrate the tool's capabilities using the Portuguese NER scenario. We'll create a consolidated dataset by merging the official HuggingFace HAREM dataset entry, representing the findings of the First-HAREM meeting, with a complementary "second-HAREM" dataset, accessible here. This "second-HAREM," not initially included in the original HuggingFace efforts, brings valuable additional data. Both HAREM datasets offer two labeling schemes: DEFAULT and SELECTIVE, making them an ideal showcase for highlighting our tool's capabilities.

from datasets import load_dataset, concatenate_datasets, DatasetDict 
from hf_dataset_structurer.DatasetStructure import DatasetStructure

structurer = DatasetStructure("<<TARGET Hugging Face Dataset Name>>")

# Iterate both Labelling Schemes
for config in ["default", "selective"]:
    # Load Official HAREM Dataset
    primeiro_harem = load_dataset("harem", config)
    
    # Start Structuring Process
    structurer.add_dataset(primeiro_harem['train'], f"primeiro_harem_{config}", split="train")

# Load Second HAREM Datasets

second_harem_default = load_dataset("arubenruben/segundo_harem_default")
second_harem_selective = load_dataset("arubenruben/segundo_harem_selective")

# Notice the function used now is add_dataset_dict. A [DatasetDict](https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.DatasetDict) is a native HuggingFace object that represents a dictionary of datasets.
structurer.add_dataset_dict(second_harem_default, "segundo_harem_default")
structurer.add_dataset_dict(second_harem_selective, "segundo_harem_selective")

# Push to Hugging Face Hub
structurer.push_to_hub()

# After creating the bundle. You can append a dataset card to it.
# Create Dataset Card to describe the dataset
structurer.attach_dataset_card(
    language="pt",
    license="cc-by-4.0",
    annotations_creators=["expert-generated"],
    task_categories=["token-classification"],
    tasks_ids=["named-entity-recognition"],
    pretty_name="HAREM",
    multilinguality='monolingual'
)

API Reference

# Initializes a new instance of the DatasetStructure class.
__init__(self, repo_name: str) -> None

# Accepts a DatasetDict and a config_name and adds it to the dataset structure.
add_dataset_dict(self, dataset_dict: DatasetDict, config_name: str) -> None

# Similar to add_dataset_dict, but accepts a Dataset and a split. Internally, it creates a DatasetDict and calls add_dataset_dict.
add_dataset(self, dataset: Dataset, config_name: str, split: str = "train") -> None

# Attaches a dataset card to the dataset structure.
attach_dataset_card(self, language: str,
                    license: str,
                    annotations_creators: str,
                    task_categories: str,
                    tasks_ids: str,
                    pretty_name: str,
                    multilinguality: str = 'monolingual') -> None

# Pushes the dataset structure and dataset card to the Hugging Face Hub.
push_to_hub(self, private: bool = False) -> None

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Acknowledgements

This tool was developed by Ruben Almeida as part of the Project PT-Pump-Up. PT-Pump-Up is a project funded by INESC TEC and the Portuguese Government through the Fundacao para a Ciencia e a Tecnologia (FCT) that aims to build Portuguese NLP resources and tools to support the development of NLP applications for Portuguese.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_dataset_structurer-0.0.6.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

hf_dataset_structurer-0.0.6-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file hf_dataset_structurer-0.0.6.tar.gz.

File metadata

  • Download URL: hf_dataset_structurer-0.0.6.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for hf_dataset_structurer-0.0.6.tar.gz
Algorithm Hash digest
SHA256 ff1bc67536efcc7d3320272be0e485fbb436cd4837dbfc6afa2853ff92190803
MD5 3c6256be6ceace79f94a1a65b92cd3c8
BLAKE2b-256 036b2ca7eae6bcdbad403f5aa46299bba41f50da40e1cff3be2ac1c4de5161ac

See more details on using hashes here.

Provenance

File details

Details for the file hf_dataset_structurer-0.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for hf_dataset_structurer-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f795238efbe28139f95e8b9e65ab54e7bd4d02fb59170532097fe1a77422ba9c
MD5 40ffaac9f0269bfa26f4029c288d3e7f
BLAKE2b-256 98bfcba743c133323ab37642dcd497e39ec8a4aad051c59c3b6226d2fe635590

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page