Skip to main content

This is a helper library to push data to HuggingFace.

Project description

huggify-data

Introduction

huggify-data 📦 is a Python library 🐍 designed to simplify the process of scraping any .pdf documents, generating question-answer pairs using openai, and then uploading datasets 📊 to the Hugging Face Hub 🤗. It allows you to verify ✅, process 🔄, and push 🚀 your pandas DataFrame directly to Hugging Face, making it easier to share and collaborate 🤝 on datasets. In addition, new version allows users to fine-tune the Llama2 model over their proprietary data.

Watch the video

Installation

To use huggify-data, ensure you have the necessary libraries installed. You can install them using pip:

pip install huggify-data

Examples

Here's a complete example to illustrate how to use the huggify-data to scrape PDF and save as question-answer pairs in a .csv file. The block of code below will scrape it, convert it into a .csv and save the file locally.

from huggify_data.scrape_modules import *

# Example usage:
pdf_path = "path_of_pdf.pdf"
openai_api_key = "<sk-API_KEY_HERE>"
generator = PDFQnAGenerator(pdf_path, openai_api_key)
generator.process_scraped_content()
generator.generate_questions_answers()
df = generator.convert_to_dataframe()
print(df)

Here's a complete example to illustrate how to use the huggify-data library to push data (assuming an existing .csv file with columns questions and answers inside) to HuggingFace Hub:

from huggify_data.push_modules import DataFrameUploader

# Example usage:
df = pd.read_csv('/content/toy_data.csv')
uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
uploader.process_data()
uploader.push_to_hub()

Here's a complete example to illustrate to use the huggify-data library to fine-tune a Llama-2 model (assuming you have a directory from HuggingFace existing):

# Param
model_name = "NousResearch/Llama-2-7b-chat-hf" # recommended base model
dataset_name = "eagle0504/sample_toy_data_v9" # give a desired name, i.e. <hf_user_id>/<desired_name>
new_model = "youthless-homeless-shelter-web-scrape-dataset-v4" # give a desired name
huggingface_token = userdata.get('HF_TOKEN')

# Initiate
trainer = LlamaTrainer(model_name, dataset_name, new_model, huggingface_token)
peft_config = trainer.configure_lora()
training_args = trainer.configure_training_arguments(num_train_epochs=2)

# Training
trainer.train_model(training_args, peft_config)

# Train and save | Run this in a new cell
trainer.merge_and_save_model()

To make an inference, please following the example below:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="eagle0504/youthless-homeless-shelter-web-scrape-dataset-v4") # same name as above
response = pipe("### Human: What is YSA? ### Assistant: ")
print(response[0]["generated_text"])
print(response[0]["generated_text"].split("### ")[-1])

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or suggestions.

Contact

For any questions or support, please contact [eagle0504@gmail.com](mailto: eagle0504@gmail.com).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

huggify_data-0.3.2.tar.gz (6.7 kB view hashes)

Uploaded Source

Built Distribution

huggify_data-0.3.2-py3-none-any.whl (7.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page