This is a helper library to push data to HuggingFace.
Project description
huggify-data
Introduction
huggify-data 📦 is a Python library 🐍 designed to simplify the process of scraping .pdf
documents, generating question-answer pairs using openai
, converse with the document, and then uploading datasets 📊 to the Hugging Face Hub 🤗. This library allows you to verify ✅, process 🔄, and push 🚀 your pandas DataFrame directly to Hugging Face, making it easier to share and collaborate 🤝 on datasets. Additionally, the new version enables users to fine-tune the Llama2 model on their proprietary data, enhancing its capabilities even further. As the name suggests, the huggify-data package enhances your data experience by wrapping it in warmth, comfort, and user-friendly interactions, making data handling feel as reassuring and pleasant as a hug.
Repo
You can access the repo here: ✨ Huggify Data ✨
Installation
To use huggify-data, ensure you have the necessary libraries installed. You can easily install them using pip:
pip install huggify-data
Notebooks
We have made tutorial notebooks available to guide you through the process step-by-step:
- Step 1: Scrape any
.pdf
file and generate question-answer pairs. Link - Step 2: Fine-tune the Llama2 model on customized data. Link
- Step 3: Perform inference on customized data. Link
Examples
Here's a complete example illustrating how to use huggify-data to scrape a PDF and save it as question-answer pairs in a .csv
file. The following block of code will scrape the content, convert it into a .csv
, and save the file locally:
from huggify_data.scrape_modules import *
# Example usage:
pdf_path = "path_of_pdf.pdf"
openai_api_key = "<sk-API_KEY_HERE>"
generator = PDFQnAGenerator(pdf_path, openai_api_key)
generator.process_scraped_content()
generator.generate_questions_answers()
df = generator.convert_to_dataframe()
print(df)
When you have a .csv
or a pd.DataFrame
frrom the previous chunk of code, you can run the following code to iteratively generate a list of .md
files from the .csv
file.
from huggify_data.bot_modules import ChatBot
bot = ChatBot(api_key=openai_api_key)
from huggify_data.generate_md_modules import *
# This code will start generate a list of .md files
# Please make sure you are in the desired directory
markdown_generator = MarkdownGenerator(bot, df)
markdown_generator.generate_markdown()
After a list of .md
files are generated, one can navigate to here to build a chatbot with RAG system to iteratively read in a list of .md
file using RAG or Retrieval Augmented Generation pipeline using llama_index.
Once you have created a data frame of question-answer pairs, you can have a conversation with your data:
from huggify_data.bot_modules import *
current_prompt = "<question_about_the_document>"
chatbot = ChatBot(api_key=openai_api_key)
response = chatbot.run_rag(openai_api_key, current_prompt, df, top_n=2)
print(response)
Moreover, you can push it to the cloud. Here's a complete example illustrating how to use the huggify-data library to push data (assuming an existing .csv
file with columns questions
and answers
) to Hugging Face Hub:
from huggify_data.push_modules import DataFrameUploader
# Example usage:
df = pd.read_csv('/content/toy_data.csv')
uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
uploader.process_data()
uploader.push_to_hub()
Here's a complete example illustrating how to use the huggify-data library to fine-tune a Llama2 model (assuming you have a directory from Hugging Face ready):
from huggify_data.train_modules import *
# Parameters
model_name = "NousResearch/Llama-2-7b-chat-hf" # Recommended base model
dataset_name = "eagle0504/sample_toy_data_v9" # Desired name, e.g., <hf_user_id>/<desired_name>
new_model = "youthless-homeless-shelter-web-scrape-dataset-v4" # Desired name
huggingface_token = userdata.get('HF_TOKEN')
# Initiate
trainer = LlamaTrainer(model_name, dataset_name, new_model, huggingface_token)
peft_config = trainer.configure_lora()
training_args = trainer.configure_training_arguments(num_train_epochs=1)
# Train
trainer.train_model(training_args, peft_config)
# Inference
some_model, some_tokenizer = trainer.load_model_and_tokenizer(
base_model_path="NousResearch/Llama-2-7b-chat-hf",
new_model_path="ysa-test-july-4-v3",
)
prompt = "hi, tell me a joke"
response = trainer.generate_response(
some_model,
some_tokenizer,
prompt,
max_len=200)
print(response)
To perform inference, please follow the example below:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="eagle0504/youthless-homeless-shelter-web-scrape-dataset-v4") # Same name as above
response = pipe("### Human: What is YSA? ### Assistant: ")
print(response[0]["generated_text"])
print(response[0]["generated_text"].split("### ")[-1])
License
This project is licensed under the MIT License. See the LICENSE file for more details.
Contributing
Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or suggestions.
Contact
For any questions or support, please contact [eagle0504@gmail.com](mailto: eagle0504@gmail.com).
About Me
Hello there! I'm excited to share a bit about myself and my projects. Check out these links for more information:
- 🏠 Personal Site: ✨ y-yin.io ✨
- 🎓 Education Site: 📚 Future Minds 📚
Feel free to explore and connect with me! 😊
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for huggify_data-0.4.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 408c56a901e535c431c685ffce415b5a399e611c332a6cdd98278519f6c0d43d |
|
MD5 | ce13f980595bae2fcb79ed6aefd281da |
|
BLAKE2b-256 | 1417ebd486fd6bccd2e3880857bf750012aefc910f0775454b343a3342782ca5 |