Skip to main content

This is a helper library to push data to HuggingFace.

Project description

huggify-data

Introduction

huggify-data 📦 is a Python library 🐍 designed to simplify the process of uploading datasets 📊 to the Hugging Face Hub 🤗. It allows you to verify ✅, process 🔄, and push 🚀 your pandas DataFrame directly to Hugging Face, making it easier to share and collaborate 🤝 on datasets.

Installation

To use huggify-data, ensure you have the necessary libraries installed. You can install them using pip:

pip install huggify-data

Usage

Here's a step-by-step guide on how to use huggify-data:

  1. Import the necessary libraries:
import pandas as pd
from huggify_data import DataFrameUploader
  1. Load your DataFrame:

Make sure your DataFrame has columns named questions and answers.

df = pd.read_csv('/content/toy_data.csv')
  1. Initialize the DataFrameUploader:

Provide your Hugging Face token, desired repository name, and username.

uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
  1. Process your data:

Convert the DataFrame into a DatasetDict object.

uploader.process_data()
  1. Push to Hugging Face Hub:

Upload your processed data to the Hugging Face Hub.

uploader.push_to_hub()

Example

Here's a complete example to illustrate how to use the huggify-data to scrape PDF and save as question-answer pairs in a .csv file. The block of code below will scrape it, convert it into a .csv and save the file locally.

# Example usage:
pdf_path = "path_of_pdf.pdf"
openai_api_key = "sk-API_KEY_HERE
generator = PDFQnAGenerator(pdf_path, openai_api_key)
generator.process_scraped_content()
generator.generate_questions_answers()
df = generator.convert_to_dataframe()
print(df)

Example

Here's a complete example to illustrate how to use the huggify-data library:

import pandas as pd
from datasets import Dataset, DatasetDict
from huggingface_hub import HfApi, create_repo
from huggify_data import DataFrameUploader

# Example usage:
df = pd.read_csv('/content/toy_data.csv')
uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
uploader.process_data()
uploader.push_to_hub()

Class Details

DataFrameUploader

DataFrameUploader is the main class provided by huggify-data.

Initialization

uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
  • df: A pandas DataFrame containing the data.
  • hf_token: Your Hugging Face API token.
  • repo_name: The desired name for the Hugging Face repository.
  • username: Your Hugging Face username.

Methods

  • verify_dataframe():

    • Checks if the DataFrame has columns named questions and answers.
    • Raises a ValueError if the columns are not present.
  • process_data():

    • Verifies the DataFrame.
    • Converts the data into a DatasetDict object.
  • push_to_hub():

    • Creates a repository on the Hugging Face Hub.
    • Pushes the DatasetDict to the repository.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or suggestions.

Contact

For any questions or support, please contact [your-email@example.com].

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

huggify_data-0.1.6.tar.gz (4.9 kB view hashes)

Uploaded Source

Built Distribution

huggify_data-0.1.6-py3-none-any.whl (5.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page