Skip to main content

This is a helper library to push data to HuggingFace.

Project description

huggify-data

Introduction

huggify-data 📦 is a Python library 🐍 designed to simplify the process of scraping .pdf documents, generating question-answer pairs using openai, converse with the document, and then uploading datasets 📊 to the Hugging Face Hub 🤗. This library allows you to verify ✅, process 🔄, and push 🚀 your pandas DataFrame directly to Hugging Face, making it easier to share and collaborate 🤝 on datasets. Additionally, the new version enables users to fine-tune the Llama2 model on their proprietary data, enhancing its capabilities even further. As the name suggests, the huggify-data package enhances your data experience by wrapping it in warmth, comfort, and user-friendly interactions, making data handling feel as reassuring and pleasant as a hug.

Watch the video

Repo

You can access the repo here: ✨ Huggify Data ✨

Installation

To use huggify-data, ensure you have the necessary libraries installed. You can easily install them using pip:

pip install huggify-data

Notebooks

We have made tutorial notebooks available to guide you through the process step-by-step:

  • Step 1: Scrape any .pdf file and generate question-answer pairs. Link
  • Step 2: Fine-tune the Llama2 model on customized data. Link
  • Step 3: Perform inference on customized data. Link

Examples

Here's a complete example illustrating how to use huggify-data to scrape a PDF and save it as question-answer pairs in a .csv file. The following block of code will scrape the content, convert it into a .csv, and save the file locally:

from huggify_data.scrape_modules import *

# Example usage:
pdf_path = "path_of_pdf.pdf"
openai_api_key = "<sk-API_KEY_HERE>"
generator = PDFQnAGenerator(pdf_path, openai_api_key)
generator.process_scraped_content()
generator.generate_questions_answers()
df = generator.convert_to_dataframe()
print(df)

When you have a .csv or a pd.DataFrame frrom the previous chunk of code, you can run the following code to iteratively generate a list of .md files from the .csv file.

from huggify_data.bot_modules import ChatBot
bot = ChatBot(api_key=openai_api_key)

from huggify_data.generate_md_modules import *

# This code will start generate a list of .md files
# Please make sure you are in the desired directory
markdown_generator = MarkdownGenerator(bot, df)
markdown_generator.generate_markdown()

After a list of .md files are generated, one can navigate to here to build a chatbot with RAG system to iteratively read in a list of .md file using RAG or Retrieval Augmented Generation pipeline using llama_index.

Once you have created a data frame of question-answer pairs, you can have a conversation with your data:

from huggify_data.bot_modules import *

current_prompt = "<question_about_the_document>"
chatbot = ChatBot(api_key=openai_api_key)
response = chatbot.run_rag(openai_api_key, current_prompt, df, top_n=2)
print(response)

Moreover, you can push it to the cloud. Here's a complete example illustrating how to use the huggify-data library to push data (assuming an existing .csv file with columns questions and answers) to Hugging Face Hub:

from huggify_data.push_modules import DataFrameUploader

# Example usage:
df = pd.read_csv('/content/toy_data.csv')
uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
uploader.process_data()
uploader.push_to_hub()

Here's a complete example illustrating how to use the huggify-data library to fine-tune a Llama2 model (assuming you have a directory from Hugging Face ready):

from huggify_data.train_modules import *

# Parameters
model_name = "NousResearch/Llama-2-7b-chat-hf" # Recommended base model
dataset_name = "eagle0504/sample_toy_data_v9" # Desired name, e.g., <hf_user_id>/<desired_name>
new_model = "youthless-homeless-shelter-web-scrape-dataset-v4" # Desired name
huggingface_token = userdata.get('HF_TOKEN')

# Initiate
trainer = LlamaTrainer(model_name, dataset_name, new_model, huggingface_token)
peft_config = trainer.configure_lora()
training_args = trainer.configure_training_arguments(num_train_epochs=1)

# Train
trainer.train_model(training_args, peft_config)

# Inference
some_model, some_tokenizer = trainer.load_model_and_tokenizer(
    base_model_path="NousResearch/Llama-2-7b-chat-hf",
    new_model_path="ysa-test-july-4-v3",
)

prompt = "hi, tell me a joke"
response = trainer.generate_response(
    some_model,
    some_tokenizer,
    prompt,
    max_len=200)
print(response)

To perform inference, please follow the example below:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="eagle0504/youthless-homeless-shelter-web-scrape-dataset-v4") # Same name as above
response = pipe("### Human: What is YSA? ### Assistant: ")
print(response[0]["generated_text"])
print(response[0]["generated_text"].split("### ")[-1])

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or suggestions.

Contact

For any questions or support, please contact [eagle0504@gmail.com](mailto: eagle0504@gmail.com).

About Me

Hello there! I'm excited to share a bit about myself and my projects. Check out these links for more information:

Feel free to explore and connect with me! 😊

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

huggify_data-0.4.4.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

huggify_data-0.4.4-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file huggify_data-0.4.4.tar.gz.

File metadata

  • Download URL: huggify_data-0.4.4.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.1 Windows/10

File hashes

Hashes for huggify_data-0.4.4.tar.gz
Algorithm Hash digest
SHA256 b6a47f67c52d7b3edcb0a86f315ba9a643c532e417260c784b60d552ce892d03
MD5 e827d3af7710661cd6963ea8701229a8
BLAKE2b-256 7bb3f9a3cb22145c822894c66c8178edda039c9f6db8e5917398d886061ae2e4

See more details on using hashes here.

File details

Details for the file huggify_data-0.4.4-py3-none-any.whl.

File metadata

  • Download URL: huggify_data-0.4.4-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.1 Windows/10

File hashes

Hashes for huggify_data-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 408c56a901e535c431c685ffce415b5a399e611c332a6cdd98278519f6c0d43d
MD5 ce13f980595bae2fcb79ed6aefd281da
BLAKE2b-256 1417ebd486fd6bccd2e3880857bf750012aefc910f0775454b343a3342782ca5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page