This is a helper library to push data to HuggingFace.
Project description
huggify-data
Introduction
huggify-data 📦 is a Python library 🐍 designed to simplify the process of uploading datasets 📊 to the Hugging Face Hub 🤗. It allows you to verify ✅, process 🔄, and push 🚀 your pandas DataFrame directly to Hugging Face, making it easier to share and collaborate 🤝 on datasets.
Installation
To use huggify-data, ensure you have the necessary libraries installed. You can install them using pip:
pip install huggify-data
Usage
Here's a step-by-step guide on how to use huggify-data:
- Import the necessary libraries:
import pandas as pd
from huggify_data import DataFrameUploader
- Load your DataFrame:
Make sure your DataFrame has columns named questions
and answers
.
df = pd.read_csv('/content/toy_data.csv')
- Initialize the DataFrameUploader:
Provide your Hugging Face token, desired repository name, and username.
uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
- Process your data:
Convert the DataFrame into a DatasetDict object.
uploader.process_data()
- Push to Hugging Face Hub:
Upload your processed data to the Hugging Face Hub.
uploader.push_to_hub()
Examples
Here's a complete example to illustrate how to use the huggify-data to scrape PDF and save as question-answer pairs in a .csv
file. The block of code below will scrape it, convert it into a .csv
and save the file locally.
from huggify_data.scrape_modules import *
# Example usage:
pdf_path = "path_of_pdf.pdf"
openai_api_key = "sk-API_KEY_HERE
generator = PDFQnAGenerator(pdf_path, openai_api_key)
generator.process_scraped_content()
generator.generate_questions_answers()
df = generator.convert_to_dataframe()
print(df)
Here's a complete example to illustrate how to use the huggify-data library:
import pandas as pd
from datasets import Dataset, DatasetDict
from huggingface_hub import HfApi, create_repo
from huggify_data.push_modules import DataFrameUploader
# Example usage:
df = pd.read_csv('/content/toy_data.csv')
uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
uploader.process_data()
uploader.push_to_hub()
Class Details
DataFrameUploader
DataFrameUploader is the main class provided by huggify-data.
Initialization
uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
- df: A pandas DataFrame containing the data.
- hf_token: Your Hugging Face API token.
- repo_name: The desired name for the Hugging Face repository.
- username: Your Hugging Face username.
Methods
-
verify_dataframe():
- Checks if the DataFrame has columns named
questions
andanswers
. - Raises a
ValueError
if the columns are not present.
- Checks if the DataFrame has columns named
-
process_data():
- Verifies the DataFrame.
- Converts the data into a DatasetDict object.
-
push_to_hub():
- Creates a repository on the Hugging Face Hub.
- Pushes the DatasetDict to the repository.
License
This project is licensed under the MIT License. See the LICENSE file for more details.
Contributing
Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or suggestions.
Contact
For any questions or support, please contact [your-email@example.com].
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for huggify_data-0.1.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ad80ce35d9c0f76902910a17dbe15df50c2241be12381f75481b62dd5ee4e83 |
|
MD5 | 7beafc7b3ebd2252c66de340ae641721 |
|
BLAKE2b-256 | 478b66e4736bda07603a08ce9a20b30bbf0973946ae6fccb5444c7c57afc8371 |