This is a helper library to push data to HuggingFace.
Project description
huggify-data
Introduction
huggify-data 📦 is a Python library 🐍 designed to simplify the process of scraping .pdf documents, generating question-answer pairs using openai, converse with the document, and then uploading datasets 📊 to the Hugging Face Hub 🤗. This library allows you to verify ✅, process 🔄, and push 🚀 your pandas DataFrame directly to Hugging Face, making it easier to share and collaborate 🤝 on datasets. Additionally, the new version enables users to fine-tune the Llama2 model on their proprietary data, enhancing its capabilities even further. As the name suggests, the huggify-data package enhances your data experience by wrapping it in warmth, comfort, and user-friendly interactions, making data handling feel as reassuring and pleasant as a hug.
Repo
You can access the repo here: ✨ Huggify Data ✨
Installation
To use huggify-data, ensure you have the necessary libraries installed. You can easily install them using pip:
pip install huggify-data
Notebooks
We have made tutorial notebooks available to guide you through the process step-by-step:
- Step 1: Scrape any
.pdffile and generate question-answer pairs. Link - Step 2: Fine-tune the Llama2 model on customized data. Link
- Step 3: Perform inference on customized data. Link
Examples
Here's a complete example illustrating how to use huggify-data to scrape a PDF and save it as question-answer pairs in a .csv file. The following block of code will scrape the content, convert it into a .csv, and save the file locally:
from huggify_data.scrape_modules import *
# Example usage:
pdf_path = "path_of_pdf.pdf"
openai_api_key = "<sk-API_KEY_HERE>"
generator = PDFQnAGenerator(pdf_path, openai_api_key)
generator.process_scraped_content()
generator.generate_questions_answers()
df = generator.convert_to_dataframe()
print(df)
When you have a .csv or a pd.DataFrame frrom the previous chunk of code, you can run the following code to iteratively generate a list of .md files from the .csv file.
from huggify_data.bot_modules import ChatBot
bot = ChatBot(api_key=openai_api_key)
from huggify_data.generate_md_modules import *
# This code will start generate a list of .md files
# Please make sure you are in the desired directory
markdown_generator = MarkdownGenerator(bot, df)
markdown_generator.generate_markdown()
After a list of .md files are generated, one can navigate to here to build a chatbot with RAG system to iteratively read in a list of .md file using RAG or Retrieval Augmented Generation pipeline using llama_index.
Once you have created a data frame of question-answer pairs, you can have a conversation with your data:
from huggify_data.bot_modules import *
current_prompt = "<question_about_the_document>"
chatbot = ChatBot(api_key=openai_api_key)
response = chatbot.run_rag(openai_api_key, current_prompt, df, top_n=2)
print(response)
Moreover, you can push it to the cloud. Here's a complete example illustrating how to use the huggify-data library to push data (assuming an existing .csv file with columns questions and answers) to Hugging Face Hub:
from huggify_data.push_modules import DataFrameUploader
# Example usage:
df = pd.read_csv('/content/toy_data.csv')
uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
uploader.process_data()
uploader.push_to_hub()
Here's a complete example illustrating how to use the huggify-data library to fine-tune a Llama2 model (assuming you have a directory from Hugging Face ready):
from huggify_data.train_modules import *
# Parameters
model_name = "NousResearch/Llama-2-7b-chat-hf" # Recommended base model
dataset_name = "eagle0504/sample_toy_data_v9" # Desired name, e.g., <hf_user_id>/<desired_name>
new_model = "youthless-homeless-shelter-web-scrape-dataset-v4" # Desired name
huggingface_token = userdata.get('HF_TOKEN')
# Initiate
trainer = LlamaTrainer(model_name, dataset_name, new_model, huggingface_token)
peft_config = trainer.configure_lora()
training_args = trainer.configure_training_arguments(num_train_epochs=1)
# Train
trainer.train_model(training_args, peft_config)
# Inference
some_model, some_tokenizer = trainer.load_model_and_tokenizer(
base_model_path="NousResearch/Llama-2-7b-chat-hf",
new_model_path="ysa-test-july-4-v3",
)
prompt = "hi, tell me a joke"
response = trainer.generate_response(
some_model,
some_tokenizer,
prompt,
max_len=200)
print(response)
To perform inference, please follow the example below:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="eagle0504/youthless-homeless-shelter-web-scrape-dataset-v4") # Same name as above
response = pipe("### Human: What is YSA? ### Assistant: ")
print(response[0]["generated_text"])
print(response[0]["generated_text"].split("### ")[-1])
License
This project is licensed under the MIT License. See the LICENSE file for more details.
Contributing
Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or suggestions.
Contact
For any questions or support, please contact [eagle0504@gmail.com](mailto: eagle0504@gmail.com).
About Me
Hello there! I'm excited to share a bit about myself and my projects. Check out these links for more information:
- 🏠 Personal Site: ✨ y-yin.io ✨
- 🎓 Education Site: 📚 Future Minds 📚
Feel free to explore and connect with me! 😊
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file huggify_data-0.4.4.tar.gz.
File metadata
- Download URL: huggify_data-0.4.4.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.1 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6a47f67c52d7b3edcb0a86f315ba9a643c532e417260c784b60d552ce892d03
|
|
| MD5 |
e827d3af7710661cd6963ea8701229a8
|
|
| BLAKE2b-256 |
7bb3f9a3cb22145c822894c66c8178edda039c9f6db8e5917398d886061ae2e4
|
File details
Details for the file huggify_data-0.4.4-py3-none-any.whl.
File metadata
- Download URL: huggify_data-0.4.4-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.1 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
408c56a901e535c431c685ffce415b5a399e611c332a6cdd98278519f6c0d43d
|
|
| MD5 |
ce13f980595bae2fcb79ed6aefd281da
|
|
| BLAKE2b-256 |
1417ebd486fd6bccd2e3880857bf750012aefc910f0775454b343a3342782ca5
|